泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 94 期

呀呀呀 创作于 2024/06/02

时间:20240602
编辑:晓瑜
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v6: RISC-V: ACPI: Add external interrupt controller support

This series adds support for the below ECR approved by ASWG. The series primarily enables irqchip drivers for RISC-V ACPI based platforms.

v0: RISC-V: Use Zkr to seed KASLR base address

Dectect the Zkr extension and use it to seed the kernel base address.

v1: RISC-V: Implement ioremap_wc/wt

To improve performance, map the memory as weakly-ordered non-cacheable normal memory.

v1: riscv: stacktrace: Add USER_STACKTRACE support

So use the perf_callchain_user() code as blueprint to implement the arch_stack_walk_user() which add userstacktrace support on riscv.

v1: external ulpi vbus control

A customer sent me a patch adding a dt property to enable external vbus control as their phy didn’t support it*. I was surprised to see that none of the other musb drivers made any use of this, but there is handling in the musb core for it - made me feel like I was missing something as to why it was not used by other drivers.

v1: Revert “riscv: mm: accelerate pagefault when badaccess”

I accidentally picked up an earlier version of this patch, which had already landed via mm. The patch I picked up contains a bug, which I kept as I thought it was a fix. So let’s just revert it.

v1: riscv: sophgo: add thermal sensor support for cv180x/sg200x SoCs

This series implements driver for Sophgo cv180x/sg200x on-chip thermal sensor and adds common thermal zones for these SoCs.

v1: riscv: dts: thead: th1520: Add PMU event node

T-HEAD th1520 uses standard C910 chip and its pmu is already supported by OpenSBI.

v1: irqchip/sifive-plic: Chain to parent IRQ after handlers are ready

Now that the PLIC uses a platform driver, the driver probed later in the boot process, where interrupts from peripherals might already be pending.

v1: riscv: perf: Add support for Control Transfer Records Ext.

This series enables Control Transfer Records extension support on riscv platform.

v1: RISC-V: hwprobe: Add MISALIGNED_PERF key

This causes problems when used in conjunction with RISCV_HWPROBE_WHICH_CPUS, since SLOW, FAST, and EMULATED have values whose bits overlap with each other.

v4: mm: multi-gen LRU: Walk secondary MMU page tables while aging

This patchset makes it possible for MGLRU to consult secondary MMUs while doing aging, not just during eviction.

v1: Zacas/Zabha support and qspinlocks

This implements [cmp]xchgXX() macros using Zacas and Zabha extensions and finally uses those newly introduced macros to add support for qspinlocks: note that this implementation of qspinlocks satisfies the forward progress guarantee.

v1: clk: clkdev: don’t fail clkdev_alloc() if over-sized

Don’t fail clkdev_alloc() if the strings are over-sized. In this case, the entry will not match during lookup, so its useless.

v1: clk: sifive: Do not register clkdevs for PRCI clocks

These clkdevs were unnecessary, because systems using this driver always look up clocks using the devicetree.

v1: irqchip/riscv-aplic: Simplify the to_of_node code

The to_of_node has is_of_node check, so there is no need to repeat the is_of_node and to_of_node. And if is_of_node is false, the to_of_node will return NULL, the of_property_present will also return NULL, so remove the redundant check.

v1: Add board support for Sipeed LicheeRV Nano

The LicheeRV Nano is a RISC-V SBC based on the Sophgo SG2002 chip. Adds minimal device tree files for this board to make it boot to a basic shell.

v1: PCI: microchip: support using either instance 1 or 2

The current driver and binding for PolarFire SoC’s PCI controller assume that the root port instance in use is instance 1.

v2: riscv: lib: relax assembly constraints in hweight

rd and rs don’t have to be the same. In some cases where rs needs to be saved for later usage, this will save us some mv instructions.

v1: RISC-V: io: Don’t have a void* PCI_IOBASE

v1: riscv: enable HAVE_ARCH_HUGE_VMAP for XIP kernel

This also fixes a boot problem for XIP kernel introduced by the commit in “Fixes:”. This commit used huge page mapping for vmemmap, but huge page vmap was not enabled for XIP kernel.

LoongArch 架构支持

v3: LoongArch: KVM: Add Binary Translation extension support

Like FPU extension, here late enabling method is used for LBT. LBT context is saved/restored on vcpu context switch path. Also this patch set BT capability detection, and BT register get/set interface for userspace vmm, so that vm supports migration with BT extension.

进程调度

v1: sched,x86: export percpu arch_freq_scale

Export the underlying percpu symbol on x86 so that external trace point helper modules can be made to work again.

v2: sched/fair: Reschedule the cfs_rq when current is ineligible

I found that some tasks have been running for a long enough time and have become illegal, but they are still not releasing the CPU. This will increase the scheduling delay of other processes. Therefore, I tried checking the current process in wakeup_preempt and entity_tick, and if it is illegal, reschedule that cfs queue.

v1: sched: core: quota and parent_quota can be uninitialized and assigned values

quota and parent_quota are first assigned values, so their use is not affected.

内存管理

v1: mm: increase totalram_pages on freeing to buddy system

Total memory represents pages managed by buddy system. After the introduction of DEFERRED_STRUCT_PAGE_INIT, it may count the pages before being managed.

v1: maple_tree: add mas_node_count() before going to slow_path in mas_wr_modify()

If there are not enough nodes, mas_node_count() set an error state via mas_set_err() and return control flow to the beginning. In the return flow, mas_nomem() checks the error status, allocates new nodes, and resumes execution again.

v4: slab: Introduce dedicated bucket allocator

v1: mm: read page_type using READ_ONCE

Let’s use READ_ONCE to avoid load tearing (shouldn’t make a difference) and to make KCSAN happy. Likely, we might also want to use WRITE_ONCE for the writer side of page_type, if KCSAN ever complains about that. But we’ll not mess with that for now.

v1: mm: sparse: Consistently use _nr

Consistenly name the return variable with an _nr suffix, whenever calling pfn_to_section_nr(), to avoid confusion with a (struct mem_section *).

v1: mm: Reduce the number of slab->folio casts

Mark a few more folio functions as taking a const folio pointer, which allows us to remove a few places in slab which cast away the const.

v2: DAMON multiple contexts support

This patch-set implements support for multiple contexts per kdamond.

v11: LUF(Lazy Unmap Flush) reducing tlb numbers over 90%

While I’m working with a tiered memory system e.g. CXL memory, I have been facing migration overhead esp. tlb shootdown on promotion or demotion between different tiers.

v1: fs: sys_ringbuffer() (WIP)

Add new syscalls for generic ringbuffers that can be attached to arbitrary (supporting) file descriptors.

v1: mm/memory-failure: Stop setting the folio error flag

Nobody checks the error flag any more, so setting it accomplishes nothing. Remove the obsolete parts of this comment; it hasn’t been true since errseq_t was used to track writeback errors in 2017.

v3: vmstat: Kernel stack usage histogram

Provide a kernel stack usage histogram to aid in optimizing kernel stack sizes and minimizing memory waste in large-scale environments. The histogram divides stack usage into power-of-two buckets and reports the results in /proc/vmstat. This information is especially valuable in environments with millions of machines, where even small optimizations can have a significant impact.

v1: mm: store zero pages to be swapped out in a bitmap

As shown in the patchseries that introduced the zswap same-filled optimization [1], 10-20% of the pages stored in zswap are same-filled. This is also observed across Meta’s server fleet. By using VM counters in swap_writepage (not included in this patchseries) it was found that less than 1% of the same-filled pages to be swapped out are non-zero pages.

v3: add mTHP support for anonymous shmem

Anonymous pages have already been supported for multi-size (mTHP) allocation through commit 19eaf44954df, that can allow THP to be configured through the sysfs interface located at ‘/sys/kernel/mm/transparent_hugepage/hugepage-XXkb/enabled’.

v1: mm: vmscan: reset sc->priority on retry

The commit 6be5e186fd65 (“mm: vmscan: restore incremental cgroup iteration”) added a retry reclaim heuristic to iterate all the cgroups before returning an unsuccessful reclaim but missed to reset the sc->priority. Let’s fix it.

v6: enable bs > ps in XFS

This is the sixth version of the series that enables block size > page size (Large Block Size) in XFS targetted for inclusion in 6.11.

v2: mm: page_type, zsmalloc and page_mapcount_reset()

Wanting to remove the remaining abuser of _mapcount/page_type along with page_mapcount_reset(), I stumbled over zsmalloc, which is yet to be converted away from “struct page”.

v5: large folios swap-in: handle refault cases first

This patch is extracted from the large folio swapin series[1], primarily addressing the handling of scenarios involving large folios in the swap cache.

v1: mm/hugetlb: Do not call vma_add_reservation upon ENOMEM

sysbot reported a splat [1] on __unmap_hugepage_range(). Check for that and do not call vma_add_reservation() if that is the case, otherwise region_abort() and region_del() will see that we do not have any file_regions.

v4: percpu_counter: add a cmpxchg-based _add_batch variant

Interrupt disable/enable trips are quite expensive on x86-64 compared to a mere cmpxchg (note: no lock prefix!) and percpu counters are used quite often.

v2: memcg: rearrange fields of mem_cgroup_per_node

Kernel test robot reported [1] performance regression for will-it-scale test suite’s page_fault2 test case for the commit 70a64b7919cb (“memcg: dynamically allocate lruvec_stats”). After inspection it seems like the commit has unintentionally introduced false cache sharing.

v7: Memory management patches needed by Rust Binder

This patchset contains some abstractions needed by the Rust implementation of the Binder driver for passing data between userspace, kernelspace, and directly into other processes.

v3: mm: migrate: support poison recover from migrate folio

This series of patches provide the recovery mechanism from folio copy for the widely used folio migration.

文件系统

v1: readdir: Add missing quote in macro comment

Add a missing double quote in the unsafe_copy_dirent_name() macro comment.

v1: ext4: simplify the counting and management of delalloc reserved blocks

This patch series is the part 3 prepartory changes of the buffered IO iomap conversion, it simplify the counting and updating logic of delalloc reserved blocks. This series has passed through kvm-xfstests in auto mode many times, please take a look at it.

v1: fs: don’t block i_writecount during exec

Back in 2021 we already discussed removing deny_write_access() for executables. Back then I was hesistant because I thought that this might cause issues in userspace. It’s not completely out of the realm of possibility but let’s find out if that’s actually the case and not guess.

v1: struct fd situation

I've done another round of review of users.

v1: kernel/sysctl-test: add MODULE_DESCRIPTION()

Fix the ‘make W=1’ warning: WARNING: modpost: missing MODULE_DESCRIPTION() in kernel/sysctl-test.o

v1: Start moving write_begin/write_end out of aops

Christoph wants to remove write_begin/write_end from aops and pass them to filemap as callback functions. Here’s one possible route to do this. I combined it with the folio conversion (because why touch the same code twice?) and tweaked some of the other things (support for ridiculously large folios with size_t lengths, remove the need to initialise fsdata by passing only a pointer to the fsdata pointer).

v1: fs/netfs/fscache_cookie: add missing “n_accesses” check

This fixes a NULL pointer dereference bug due to a data race which looks like this:

v1: KTEST: add test to exercise the new mount API for bcachefs

v1: v5.1: fs: Allow fine-grained control of folio sizes

We need filesystems to be able to communicate acceptable folio sizes to the pagecache for a variety of uses (e.g. large block sizes). Support a range of folio sizes between order-0 and order-31

v1: netfs: Fault in smaller chunks for non-large folio mappings

As in commit 4e527d5841e2 (“iomap: fault in smaller chunks for non-large folio mappings”), we can see a performance loss for filesystems which have not yet been converted to large folios.

v1: fs: autofs: add MODULE_DESCRIPTION()

Fix the ‘make W=1’ warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/autofs/autofs4.o

v1: enhance the path resolution capability in fs_parser

The following is a brief overview of the patches, see the patches for more details.

v1: isofs: add missing MODULE_DESCRIPTION()

Fix the ‘make W=1’ warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/isofs/isofs.o

网络设备

v5: ext4: check hash version and filesystem casefolded consistent

When mounting the ext4 filesystem, if the hash version and casefolded are not consistent, exit the mounting.

v2: PCIe TPH and cache direct injection support

This series introduces generic TPH support in Linux, allowing STs to be retrieved from ACPI _DSM (as defined by ACPI) and used by PCIe endpoint drivers as needed.

v2: net-next: vmxnet3: upgrade to version 9

This patch series extends vmxnet3 driver to leverage these new feature.

v3: net-next: net: mana: Allow variable size indirection table

Allow variable size indirection table allocation in MANA instead of using a constant value MANA_INDIRECT_TABLE_SIZE. The size is now derived from the MANA_QUERY_VPORT_CONFIG and the indirection table is allocated dynamically.

v4: Add Microchip KSZ 9897 Switch CPU PHY + Errata

Back in 2022, I had posted a series of patches to support the KSZ9897 switch’s CPU PHY ports but some discussions had not been concluded with Microchip. I’ve been maintaining the patches since and I’m now resubmitting them with some improvements to handle new KSZ9897 errata sheets (also concerning the whole KSZ9477 family).

v2: net-next: net: smc91x: Refactor SMC_* macros

Use the macro parameter lp directly instead of relying on ioaddr being defined in the surrounding scope.

v2: vmxnet3: disable rx data ring on dma allocation failure

To fix this bug, rq->data_ring.desc_size needs to be set to 0 to tell the hypervisor to disable this feature.

v1: net-next: Introduce EN7581 ethernet support

Add airoha_eth driver in order to introduce ethernet support for Airoha EN7581 SoC available on EN7581 development board.

v3: iwl-net: ice: implement AQ download pkg retry

ice_aqc_opc_download_pkg (0x0C40) AQ sporadically returns error due to FW issue. Fix this by retrying five times before moving to Safe Mode.

v4: net: tcp/mptcp: count CLOSE-WAIT for CurrEstab

Taking CLOSE-WAIT sockets into CurrEstab counters is in accordance with RFC

v2: ext4: add casefolded feature check before setup encrypted info

Due to the current file system not supporting the casefolded feature, only i_crypt_info was initialized when creating encrypted information, without actually setting the sighash. Therefore, when creating an inode, if the system does not support the casefolded feature, encrypted information will not be created.

v1: net-next: tcp: refactor skb_cmp_decrypted() checks

Refactor the input patch coalescing checks and wrap “EOR forcing” logic into a helper. This will hopefully make the code easier to follow. While at it throw some DEBUG_NET checks into skb_shift().

v2: net-next: net: visibility of memory limits in netns

Some programs need to know the size of the network buffers to operate correctly, export the following sysctls read-only in network namespaces.

v1: net-next: ionic: advertise 52-bit addressing limitation for MSI-X

Current ionic devices only support 52 internal physical address lines. This is sufficient for x86_64 systems which have similar limitations but does not apply to all other architectures, notably IBM POWER (ppc64).

v2: net-next: bnxt_en: add timestamping statistics support

The ethtool_ts_stats structure was introduced earlier this year. Now it’s time to support this group of counters in more drivers. This patch adds support to bnxt driver.

v10: net-next: Device Memory TCP

v4: net-next: net: allow dissecting/matching tunnel control flags

Ilya says: “for correct matching on decapsulated packets, we should match on not only tunnel id and headers, but also on tunnel configuration flags like TUNNEL_NO_CSUM and TUNNEL_DONT_FRAGMENT.

v1: net-next: af_unix: Don’t check last_len in unix_stream_data_wait().

When commit 869e7c62486e (“net: af_unix: implement stream sendpage support”) added sendpage() support, data could be appended to the last skb in the receiver’s queue.

v2: net-next: tcp: add sysctl_tcp_rto_min_us

Adding a sysctl knob to allow user to specify a default rto_min at socket init time.

GIT PULL: Networking for v6.10-rc2

[net-next PATCH] octeontx2: Improve mailbox tracepoints for debugging

The tracepoints present currently wrt mailbox do not provide enough information to debug mailbox activity.

安全增强

v4: Hardening perf subsystem

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows .

v1: ubsan: add missing MODULE_DESCRIPTION() macro

Add the missing invocation of the MODULE_DESCRIPTION() macro.

v4: Introduce STM32 DMA3 support

In STM32MP25 SoC [1], 3 HPDMAs and 1 LPDMA are embedded. Only HPDMAs are used by Linux.

v1: x86/boot: add prototype for __fortify_panic()

As discussed in [1] add a prototype for __fortify_panic() to fix the ‘make W=1 C=1’ warning:

v1: x86/hpet: Read HPET directly if panic in progress

To avoid this dead loops, read HPET directly if panic in progress.

v2: dma-buf/fence-array: Add flex array to struct dma_fence_array

This is an effort to get rid of all multiplications from allocation functions in order to prevent integer overflows .

异步 IO

v3: liburing: test: add test cases for hugepage registered buffers

Add a test file for hugepage registered buffers, to make sure the fixed buffer coalescing feature works safe and soundly.

v1: io_uring/net: assign kmsg inq/flags before buffer selection

syzbot reports that recv is using an uninitialized value:

Rust For Linux

v3: Rust block device driver API and null block driver

Rebased on v6.10-rc1 and implemented a ton of improvements suggested by Benno. v2 is here [2]

v2: net::phy support for C45

Adds support for reading/writing C45 registers and genphy helper functions executed via C45 registers.

v1: Makefile: rust-analyzer target: better error handling and comments

This is confusing at first, because there is, in fact, a rust-analyzer build target. It’s just not set up to handle errors gracefully.

v1: kbuild: rust: provide an option to inline C helpers into Rust

This RFC presents an option `RUST_LTO_HELPERS` to inline C helpers into Rust. This is similar to LTO, but we perform the extra inlining and optimisation per Rust crate (compilation unit) instead of at final linking time, thus has better compilation speed. It also means that this presented approach work for loadable modules as well.

v1: rust: net::phy support to C45 registers access

Adds support for C45 registers access. C45 registers can be accessed in two ways: either C45 bus protocol or C45 over C22. Normally, a PHY driver shouldn’t care how to access. PHYLIB chooses the appropriate one. But there is an exception; PHY hardware supporting only C45 bus protocol.

BPF

v1: bpf-next: libbpf: implement BTF field iterator

Switch from callback-based iteration over BTF type ID and string offset fields to an iterator-based approach. Switch all existing internal use cases to this new iterator.

v1: bpf: Make session kfuncs global

The bpf_session_cookie is unavailable for !CONFIG_FPROBE as reported by Sebastian . Instead of adding more ifdefs, making the session kfuncs globally available as suggested by Alexei. It’s still allowed only for session programs, but it won’t fail the build.

v2: net-next: virtnet_net: prepare for af-xdp

This patch set prepares for supporting af-xdp zerocopy. There is no feature change in this patch set. I just want to reduce the patch num of the final patch set, so I split the patch set.

v1: bpf-next: use network helpers, part 6

For moving dctcp test dedicated code out of do_test() into test_dctcp(). This patchset adds a new helper start_test() in bpf_tcp_ca.c to refactor do_test(). Address Martin’s comments for the previous series.

v7: bpf-next: Notify user space when a struct_ops object is detached/unregistered

This patch set enables the detach feature for struct_ops links and send an event to epoll when a link is detached. Subsystems could call link->ops->detach() to detach a link and notify user space programs through epoll.

v1: net: tap: validate metadata and length for XDP buff before building up skb

The cited commit missed to check against the validity of the length and various pointers on the XDP buff metadata in the tap_get_user_xdp() path, which could cause a corrupted skb to be sent downstack. For instance, tap_get_user() prohibits short frame which has the length less than Ethernet header size from being transmitted, while the skb_set_network_header() in tap_get_user_xdp() would set skb’s network_header regardless of the actual XDP buff data size. This could either cause out-of-bound access beyond the actual length, or confuse the underlayer with incorrect or inconsistent header length in the skb metadata.

v1: bpf: libbpf: don’t close(-1) in multi-uprobe feature detector

Guard close(link_fd) with extra link_fd >= 0 check to prevent close(-1). Detected by Coverity static analysis.

v1: bpf-next: libbpf: keep FD_CLOEXEC flag when dup()’ing FD

Make sure to preserve and/or enforce FD_CLOEXEC flag on duped FDs. Use dup3() with O_CLOEXEC flag for that.

v2: net-next: net: validate SO_TXTIME clockid coming from userspace

Add validation in setsockopt to support only CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace.

v1: bpftool: Query only cgroup-related attach types

v4: bpf-next: netfilter: Add the capability to offload flowtable in XDP layer

This series has been tested running the xdp_flowtable_offload eBPF program on an ixgbe 10Gbps NIC (eno2) in order to XDP_REDIRECT the TCP traffic to a veth pair (veth0-veth1) based on the content of the nf_flowtable as soon as the TCP connection is in the established state.

v2: bpf: Allocate bpf_event_entry with node info

It was reported that accessing perf_event map entry caused pretty high LLC misses in get_map_perf_counter(). As reading perf_event is allowed for the local CPU only, I think we can use the target CPU of the event as hint for the allocation like in perf_event_alloc() so that the event and the entry can be in the same node at least.

v1: net: validate SO_TXTIME clockid coming from userspace

Add validation in setsockopt to support only CLOCK_REALTIME, CLOCK_MONOTONIC and CLOCK_TAI to be set from userspace.

v5: bpf-next: bpf: support resilient split BTF

The series first focuses on generating split BTF with distilled base BTF; then relocation support is added to allow split BTF with an associated distlled base to be relocated with a new base BTF.

周边技术动态

Qemu

v2: Improve the performance of RISC-V vector unit-stride/whole register ld/st instructions

In this new version, we added patches that try to load/store more data at a time in part of vector continuous load/store (unit-stride/whole register) instructions with some assumptions (e.g. no masking, no tail agnostic, perform virtual address resolution once for the entire vector, etc.) as suggested by Richard Henderson.

v1: hw/riscv/virt.c: add address-cells in create_fdt_one_aplic()

We need #address-cells properties in all interrupt controllers that are referred by an interrupt-map [1]. For the RISC-V machine, both PLIC and APLIC controllers must have this property.

v1: target/riscv: Add support for Control Transfer Records Ext.

This series enables Control Transfer Records extension support on riscv platform. This extension is similar to Arch LBR in x86 and BRBE in ARM.

v2: target/riscv: zvbb implies zvkb

  • According to RISC-V crypto spec, Zvkb extension is a proper subset of the Zvbb extension.

v2: RESEND: target/riscv/kvm: QEMU support for KVM Guest Debug on RISC-V

This series implements QEMU KVM Guest Debug on RISC-V, with which we could debug RISC-V KVM guest from the host side, using software breakpoints.

v2: target/riscv/kvm: QEMU support for KVM Guest Debug on RISC-V

This series implements QEMU KVM Guest Debug on RISC-V, with which we could debug RISC-V KVM guest from the host side, using software breakpoints.

v1: targer/riscv: Implement Zabha extension

Add Zabha implementation.

v1: riscv-to-apply queue

v7: target/riscv/kvm/kvm-cpu.c: kvm_riscv_handle_sbi() fail with vendor-specific SBI

Add new error path to provide proper error in case of qemu_chr_fe_read_all() may not return sizeof(ch), because exactly zero just means we failed to read input, which can happen, so telling the SBI caller we failed to read, but telling the caller of this function that we successfully emulated the SBI call, is correct. However, anything else, other than sizeof(ch), means something unexpected happened, so we should return an error. Added SBI related return code’s defines.

U-Boot

v1: doc: Add UEFI supplement document

Add UEFI supplement document to define some behaviours on architectures not covered by the original specification.



Read Album:

Read Related:

Read Latest: