泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 44 期

呀呀呀 创作于 2023/05/02

时间:20230501
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v1: RISC-V: Export Zba, Zbb to usermode via hwprobe

This change detects the presence of Zba and Zbb extensions and exports them per-hart to userspace via the hwprobe mechanism. Glibc can then use these in setting up hwcaps-based library search paths.

GIT PULL: RISC-V Patches for the 6.4 Merge Window, Part 1

RISC-V Patches for the 6.4 Merge Window, Part 1

  • Support for runtime detection of the Svnapot extension.
  • Support for Zicboz when clearing pages.
  • We’ve moved to GENERIC_ENTRY.
  • Support for !MMU on rv32 systems.
  • The linear region is now mapped via huge pages.
  • Support for building relocatable kernels.
  • Support for the hwprobe interface.
  • Various fixes and cleanups throughout the tree.

v2: riscv: allow case-insensitive ISA string parsing

The original motivation for my patch v1[5] is that some SoC generators will provide generated DT with illegal ISA string in dt-binding such as rocket-chip, which will even cause kernel panic in some cases as I mentioned in v1[5]. Now, the rocket-chip has been fixed in PR #3333[6]. However, when using some specific version of rocket-chip with illegal ISA string in DT, this patchset will also work for parsing uppercase letters correctly in DT, thus will have better compatibility.

v1: Limit the number of counter returned from SBI.

Perf relies on reliability of SBI. If sth goes wrong the code trusts it. It happened due to some debug process that I passed more than RISCV_MAX_COUNTERS to perf from SBI. At the first glance there were bloating of kalloced variable pmu_ctr_list and counter mask recycle write. May be there were some other effects. But anyway it is better to add extra check.

v1: -next: clk: sifive: Use devm_platform_ioremap_resource()

Convert platform_get_resource(),devm_ioremap_resource() to a single call to devm_platform_ioremap_resource(), as this is exactly what this function does.

v2: RISC-V: Align SBI probe implementation with spec

sbi_probe_extension() is specified with “Returns 0 if the given SBI extension ID (EID) is not available, or 1 if it is available unless defined as any other non-zero value by the implementation.” Additionally, sbiret.value is a long. Fix the implementation to ensure any nonzero long value is considered a success, rather than only positive int values.

v1: dt-bindings: riscv: explicitly mention assumption of Zicsr & Zifencei support

The dt-binding was defined before the extraction of csr access and fence.i into their own extensions, and thus the presence of the I base extension implies Zicsr and Zifencei. There’s no harm in adding them obviously, but for backwards compatibility with DTs that existed prior to that extraction, software is unable to differentiate between “i” and “i_zicsr_zifencei” without any further information.

v1: RISC-V: KVM: Ensure SBI extension is enabled

Ensure guests can’t attempt to invoke SBI extension functions when the SBI extension’s probe function has stated that the extension is not available.

v1: Handle multi-letter extensions starting with caps in riscv,isa

Following on from [1] in which Yangyu reported kernel panics for a riscv,isa string containing “rv64ima_Zifencei”, as the parser got confused by the capital letter, here’s a small change to the parser to handle invalid extensions starting with capital & the removal of some inaccurate wording from the dt-binding.

v1: dmaengine: xilinx: enable on RISC-V platform

Enable the xilinx dmaengine driver on RISC-V platform. We have verified the CDMA on RISC-V platform, enable this configuration to allow build on RISC-V.

v1: Allow case-insensitive RISC-V ISA string

According to RISC-V ISA specification, the ISA naming strings are case insensitive. The kernel docs require the riscv,isa string must be all lowercase to simplify parsing currently. However, this limitation is not consistent with RISC-V ISA Spec.

v2: riscv: mm: Ensure prot of VM_WRITE and VM_EXEC must be readable

Commit 8aeb7b17f04e (“RISC-V: Make mmap() with PROT_WRITE imply PROT_READ”) allows riscv to use mmap with PROT_WRITE only, and meanwhile mmap with w+x is also permitted. However, when userspace tries to access this page with PROT_WRITE|PROT_EXEC, which causes infinite loop at load page fault as well as it triggers soft lockup. According to riscv privileged spec, “Writable pages must also be marked readable”. The fix to drop the PAGE_COPY_READ_EXEC and then PAGE_COPY_EXEC would be just used instead. This aligns the other arches (i.e arm64) for protection_map.

v1: Expose the isa-string via the AT_BASE_PLATFORM aux vector

The hwprobing infrastructure was merged recently [0] and contains a mechanism to probe both extensions but also microarchitecural features on a per-core level of detail.

v1: RESEND: dt-bindings: riscv: add sv57 mmu-type

Dumping the dtb from new versions of QEMU warns that sv57 is an undocumented mmu-type. The kernel has supported sv57 for about a year, so bring it into the fold.

v5: Add STG/ISP/VOUT clock and reset drivers for StarFive JH7110

This patch serises are base on the basic JH7110 SYSCRG/AONCRG drivers and add new partial clock drivers and reset supports about System-Top-Group(STG), Image-Signal-Process(ISP) and Video-Output(VOUT) for the StarFive JH7110 RISC-V SoC. These clocks and resets could be used by DMA, VIN and Display modules.

v10: riscv: Allow to downgrade paging mode from the command line

his new version gets rid of the limitation that prevented KASAN kernels to use the newly introduced parameters.

While looking into KASLR, I fell onto commit aacd149b6238 (“arm64: head: avoid relocating the kernel twice for KASLR”): it allows to use the fdt functions very early in the boot process with KASAN enabled by simply compiling a new version of those functions without instrumentation.

v1: riscv: replace deprecated scall with ecall

scall is a deprecated alias for ecall. ecall is used in several places, so there is no assembler compatibility concern.

进程调度

v2: sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in kernel, and it’s beneficial to create a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing codebase to it is not an easy process.

v1: sched: core: Simplify sched_can_stop_tick()

Remove useless intermediate variable “fifo_nr_running”.

v1: sched: add ttwu_migration counter

This patch adds the ttwu_migration counter to record the migrations. Put it at the end, do not break some tools.

内存管理

v2: permit write-sealed memfd read-only shared mappings

The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:-

Furthermore, trying to create new shared, writable memory-mappings via
mmap(2) will also fail with EPERM.

v1: mm/mmap/vma_merge: always check invariants

We may still have inconsistent input parameters even if we choose not to merge and the vma_merge() invariant checks are useful for checking this with no production runtime cost (these are only relevant when CONFIG_DEBUG_VM is specified).

v1: debugobjects,locking: Annotate __debug_object_init() wait type violation

On Tue, Apr 25, 2023 at 11:51:05PM +0800, Qi Zheng wrote:

I just tested the following code and it can resolve the warning I encountered. :)

v3: Reduce lock contention related with large folio

yan tried to enable the large folio for anonymous mapping [1].

Unlike large folio for page cache which doesn’t trigger frequent page allocation/free, large folio for anonymous mapping is allocated/freeed more frequently. So large folio for anonymous mapping exposes some lock contention.

v3: migrate_pages: Avoid blocking for IO in MIGRATE_SYNC_LIGHT

The MIGRATE_SYNC_LIGHT mode is intended to block for things that will finish quickly but not for things that will take a long time. Exactly how long is too long is not well defined, but waits of tens of milliseconds is likely non-ideal.

v3: net-next/mm: page_pool: new approach for leak detection and shutdown phase

The page_pool (PP) workqueue calling page_pool_release_retry generate too many false-positive reports. Further more, these reports of page_pool shutdown still having inflight packets are not very helpful to track down the root-cause.

**[v8: mm: shmem: support POSIX_FADV_[WILLDONT]NEED for shmem files](http://lore.kernel.org/linux-mm/cover.1682598808.git.quic_charante@quicinc.com/)**

This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED advices to shmem files which can be helpful for the drivers who may want to manage the pages of shmem files on their own, like, that are created through shmem_file_setup_with_mnt.

v2: memcg: OOM log improvements

This short patch series brings back some cgroup v1 stats in OOM logs that were unnecessarily changed before. It also makes memcg OOM logs less reliant on printk() internals.

v1: mm: Do not reclaim private data from pinned page

If the page is pinned, there’s no point in trying to reclaim it. Furthermore if the page is from the page cache we don’t want to reclaim fs-private data from the page because the pinning process may be writing to the page at any time and reclaiming fs private info on a dirty page can upset the filesystem (see link below).

v1: mm: optimization on page allocation when CMA enabled

Please be notice bellowing typical scenario that commit 168676649 introduce, that is, 12MB free cma pages ‘help’ GFP_MOVABLE to keep draining/fragmenting U&R page blocks until they shrink to 12MB without enter slowpath which against current reclaiming policy. This commit change the criteria from hard coded ‘1/2’ to watermark check which leave U&R free pages stay around WMARK_LOW when being fallback.

v5: mm/gup: disallow GUP writing to file-backed mappings by default

Writing to file-backed mappings which require folio dirty tracking using GUP is a fundamentally broken operation, as kernel write access to GUP mappings do not adhere to the semantics expected by a file system.

v3: Preserved-over-Kexec RAM

Sending out this RFC in part to guage community interest. This patchset implements preserved-over-kexec memory storage or PKRAM as a method for saving memory pages of the currently executing kernel so that they may be restored after kexec into a new kernel. The patches are adapted from an RFC patchset sent out in 2013 by Vladimir Davydov [1]. They introduce the PKRAM kernel API.

v2: Add support for sharing page tables across processes (Previously mshare)

This patch series adds a new flag to mmap() call - MAP_SHARED_PT. This flag can be specified along with MAP_SHARED by a process to hint to kernel that it wishes to share page table entries for this file mapping mmap region with other processes. Any other process that mmaps the same file with MAP_SHARED_PT flag can then share the same page table entries. Besides specifying MAP_SHARED_PT flag, the processes must map the files at a PMD aligned address with a size that is a multiple of PMD size and at the same virtual addresses. This last requirement of same virtual addresses can possibly be relaxed if that is the consensus.

v4: shmem: Add user and group quota support for tmpfs

Hello folks.

This is the final version of the quota support from tmpfs, with all the issues addressed, and now including RwB tags on all patches, and should be ready for merge. Details are within each patch, and the original cover-letter below.

v1: mm/oom_kill: system enters a state something like hang when running stress-ng

When we run stress-ng on the UC (Ubuntu Core), the system will be in a state similar to hang. And we found if a testcase could introduce the oom (like stress-ng-bigheap, stress-ng-brk, …) under the UC, it is highly possible that this testcase will make the system be in a state like hang. We had a discussion for this issue here: https://github.com/ColinIanKing/stress-ng/pull/270

v2: mm: compaction: optimize compact_memory to comply with the admin-guide

For the /proc/sys/vm/compact_memory file, the admin-guide states: When 1 is written to the file, all zones are compacted such that free memory is available in contiguous blocks where possible. This can be important for example in the allocation of huge pages although processes will also directly compact memory as required

v4: mm/page_alloc: add some comments to explain the possible hole in __pageblock_pfn_to_page()

Now the __pageblock_pfn_to_page() is used by set_zone_contiguous(), which checks whether the given zone contains holes, and uses pfn_to_online_page() to validate if the start pfn is online and valid, as well as using pfn_valid() to validate the end pfn.

GIT PULL: ext4 changes for the 6.4 merge window

The following changes since commit e8d018dd0257f744ca50a729e3d042cf2ec9da65:

Linux 6.3-rc3 (2023-03-19 13:27:55 -0700)

are available in the Git repository at:

https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git tags/ext4_for_linus

v2: fs: multigrain timestamps

While I don’t think we can practically optimize away ctime updates like we do with i_version, I do like the idea of using this scheme to indicate when we need to use a high-res timestamp.

v4: of: fdt: Scan /memreserve/ last

Change the scanning /memreserve/ and /reserved-memory node order to fix Kernel panic on Khadas Vim3 Board.

If /memreserve/ goes first, the memory is reserved, but nomap can’t be applied to the region. So the memory won’t be used by Linux, but it is still present in the linear map as normal memory, which allows speculation. Legitimate access to adjacent pages will cause the CPU to end up prefetching into them leading to Kernel panic.

v1: string: use __builtin_memcpy() in strlcpy/strlcat

lib/string.c is built with -ffreestanding, which prevents the compiler from replacing certain functions with calls to their library versions.

v1: -v2: mm,unmap: avoid flushing TLB in batch if PTE is inaccessible

The version 1 of this patch was merged in mm-unstable branch. If you want to move that patch into mm-stable recently, it may be better to update that patch with this new version firstly. If you want to do that after v6.4-rc1, I will rebase this patch and resend it after v6.4-rc1 is released.

RFC: allow building a kernel without buffer_heads

after all the talk about removing buffer_heads, here is a series that shows how to build a kernel without buffer_heads. And how unrealistic it is to remove the entirely.

v1: mmzone: Introduce for_each_populated_zone_pgdat()

Instead of define an index and determining if the zone has memory, introduce for_each_populated_zone_pgdat() helper that can be used to iterate over each populated zone in pgdat, and convert the most obvious users to it.

文件系统

GIT PULL: iomap: new code for 6.4

Please pull this branch with changes for iomap for 6.4-rc1. The only changes for this cycle are the addition of tracepoints to the iomap directio code so that Ritesh (who is working on porting ext2 to iomap) can observe the io flows more easily. Dave will be sending you a pull request for xfs code for this cycle.

v1: Prepare for supporting more filesystems with fanotify

This is the second part of the proposal to support fanotify reporing file ids on overlayfs.

The first part [1] relaxes the requirements for filesystems to support reporting events with fid to require only the ->encode_fh() operation.

GIT PULL: sysctl changes for v6.4-rc1

Note: given we save memory per each change move away from each deprecated call, I don’t see a need to immediately pause all kernel/sysctl.c moves. Each replacement of a deprecated call saves us memory and likely more than a the simple empty entry when we move a kernel/syctl.c entry to its own file.

v1: inotify: Avoid reporting event with invalid wd

When inotify_freeing_mark() races with inotify_handle_inode_event() it can happen that inotify_handle_inode_event() sees that i_mark->wd got already reset to -1 and reports this value to userspace which can confuse the inotify listener. Avoid the problem by validating that wd is sensible (and pretend the mark got removed before the event got generated otherwise).

RFC: allow building a kernel without buffer_heads

after all the talk about removing buffer_heads, here is a series that shows how to build a kernel without buffer_heads. And how unrealistic it is to remove the entirely.

Most of the series refactors some common code to make implementing direct I/O easier without use of the ->direct_IO method and the helpers based around it. It then switches buffered writes (but not writeback) for block devices to use iomap unconditionally, but still using buffer_heads.

git pull: vfs.git misc pile

The following changes since commit eeac8ede17557680855031c6f305ece2378af326:

Linux 6.3-rc2 (2023-03-12 16:36:44 -0700)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-misc

for you to fetch changes up to 73bb5a9017b93093854c18eb7ca99c7061b16367:

fs: Fix description of vfs_tmpfile() (2023-03-12 20:03:48 -0400)

git pull: fget() whack-a-mole

The following changes since commit fe15c26ee26efa11741a7b632e9f23b01aca4cc6:

Linux 6.3-rc1 (2023-03-05 14:52:03 -0800)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git tags/pull-fd

for you to fetch changes up to 4a892c0fe4bb0546d68a89fa595bd22cb4be2576:

fuse_dev_ioctl(): switch to fdget() (2023-04-20 22:55:35 -0400)

fget() to fdget() conversions

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.

网络设备

v4: net: mvpp2: tai: add extts support

This patch series adds support for PTP event capture on the Aramda 80x0/70x0. This feature is mainly used by tools linux ts2phc(3) in order to synchronize a timestamping unit (like the mvpp2’s TAI) and a system DPLL on the same PCB.

v1: net: virtio-net: allow usage of small vrings

At the moment, if a virtio network device uses vrings with less than MAX_SKB_FRAGS + 2 entries, the device won’t be functional.

The following condition vq->num_free >= 2 + MAX_SKB_FRAGS will always evaluate to false, leading to TX timeouts.

v2: net: bonding: add xdp_features support

Introduce xdp_features support for bonding driver according to the slave devices attached to the master one. xdp_features is required whenever we want to xdp_redirect traffic into a bond device and then into selected slaves attached to it.

v3: virtio_net: suppress cpu stall when free_unused_bufs

For multi-queue and large ring-size use case, the following error occurred when free_unused_bufs: rcu: INFO: rcu_sched self-detected stall on CPU.

v1: net: atlantic: Define aq_pm_ops conditionally on CONFIG_PM

The only use of aq_pm_ops is conditional on CONFIG_PM. The definition of aq_pm_ops and its functions should also be conditional on CONFIG_PM.

v1: igb: Define igb_pm_ops conditionally on CONFIG_PM

The only use of igb_pm_ops is conditional on CONFIG_PM. The definition of igb_pm_ops should also be conditional on CONFIG_PM

v4: bpf: Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings.

When calling socket lookup from L2 (tc, xdp), VRF boundaries aren’t respected. This patchset fixes this by regarding the incoming device’s VRF attachment when performing the socket lookups from tc/xdp.

The first two patches are coding changes which factor out the tc helper’s logic which was shared with cg/sk_skb (which operate correctly).

v4: bpf-next: Introduce a new kfunc of bpf_task_under_cgroup

Trace sched related functions, such as enqueue_task_fair, it is necessary to specify a task instead of the current task which within a given cgroup.

v4: net-next: Wangxun netdev features support

Implement tx_csum and rx_csum to support hardware checksum offload. Implement ndo_vlan_rx_add_vid and ndo_vlan_rx_kill_vid. Enable macros in netdev features which wangxun can support.

v7: Create common DPLL configuration API

Implement common API for clock/DPLL configuration and status reporting. The API utilises netlink interface as transport for commands and event notifications. This API aim to extend current pin configuration and make it flexible and easy to cover special configurations.

v2: can: bxcan: add support for single peripheral configuration

The series adds support for managing bxCAN controllers in single peripheral configuration. Unlike stm32f4 SOCs, where bxCAN controllers are only in dual peripheral configuration, stm32f7 SOCs contain three CAN peripherals, CAN1 and CAN2 in dual peripheral configuration and CAN3 in single peripheral

v1: net-next: pds_core: add switchdev and tc for vlan offload

This is an RFC for adding to the pds_core driver some very simple support for VF representors and a tc command for offloading VF port vlans.

v1: net: add xdp_features support for bonding driver

Introduce missing xdp_features support for bonding driver. xdp_features is required whenever we want to xdp_redirect traffic into a bond device and then into selected slaves attached to it.

v1: net-next: net: tcp: make txhash use consistent for IPv4

Series is divided in two parts. First two commits make the txhash (used for the skb hash in TCP) to be consistent for all IPv4/TCP packets (IPv6 doesn’t have the same issue). Last two commits improve doc/comment hash-related parts.

v1: mISDN: Use list_count_nodes()

count_list_member() really looks the same as list_count_nodes(), so use the latter instead of hand writing it.

The first one return an int and the other a size_t, but that should be fine. It is really unlikely that we get so many parties in a conference.

v1: net: ice: block LAN in case of VF to VF offload

VF to VF traffic shouldn’t go outside. To enforce it, set only the loopback enable bit in case of all ingress type rules added via the tc tool.

v4: net-next: virtio_net: refactor xdp codes

Due to historical reasons, the implementation of XDP in virtio-net is relatively chaotic. For example, the processing of XDP actions has two copies of similar code. Such as page, xdp_page processing, etc.

v1: leds: introduce new LED hw control APIs

This is a continue of [1]. It was decided to take a more gradual approach to implement LEDs support for switch and phy starting with basic support and then implementing the hw control part when we have all the prereq done.

v1: net-next: wifi: ath10k: Use list_count_nodes()

ath10k_wmi_fw_stats_num_peers() and ath10k_wmi_fw_stats_num_vdevs() really look the same as list_count_nodes(), so use the latter instead of hand writing it.

v1: net-next: wifi: ath11k: Use list_count_nodes()

ath11k_wmi_fw_stats_num_vdevs() and ath11k_wmi_fw_stats_num_bcn() really look the same as list_count_nodes(), so use the latter instead of hand writing it.

The first ones use list_for_each_entry() and the other list_for_each(), but they both count the number of nodes in the list.

v2: net: dsa: mv88e6xxx: add mv88e6321 rsvd2cpu

Add rsvd2cpu capability for mv88e6321 model, to allow proper bpdu processing.

v1: net-next: wifi: mwifiex: Use list_count_nodes()

mwifiex_wmm_list_len() is the same as list_count_nodes(), so use the latter instead of hand writing it.

Turn ‘ba_stream_num’ and ‘ba_stream_max’ in size_t to keep the same type as what is returned by list_count_nodes().

v4: New NDO methods ndo_hwtstamp_get/set

You patch series work on my side with the macb MAC controller and this patch. I don’t know if you are waiting for more reviews but it seems good enough to drop the RFC tag.

v3: net: net/sched: act_mirred: Add carrier check

As you can see, it’s administratively UP but operationally down. In this case, sending a packet to this port caused a nasty kernel hang (so nasty that we were unable to capture it). Aborting a transmit based on operational status (in addition to administrative status) fixes the issue.

GIT PULL: Networking for 6.4

We have a few conflicts with your current tree, specifically:

  • between commits:

    dbb0ea153401 (“thermal: Use thermal_zone_device_type() accessor”)

the latter removed the code updated by the former, the resolution is deleting mlxsw_thermal_module_trips_reset() and mlxsw_thermal_module_trips_update().

v1: net-next: add driver support for Microchip LAN865X Rev.B0 Internal PHYs

The first patch updates the LAN867x PHY supported revision number to Rev.B1 and the second patch adds the support for Microchip LAN865X Rev.B0 10BASE-T1S Internal PHYs.

v2: net-next: Add support for VSC8531_02 PHY and DT RGMII tuning

Add support for VSC8531_02 PHY ID. Also provide an option to tune RGMII delay value via devicetree. The default delays are retained in the driver.

v1: bpf-next: net/smc: Introduce BPF injection capability

This patches attempt to introduce BPF injection capability for SMC, and add selftest to ensure code stability.

As we all know that the SMC protocol is not suitable for all scenarios, especially for short-lived. However, for most applications, they cannot guarantee that there are no such scenarios at all. Therefore, apps may need some specific strategies to decide shall we need to use SMC or not, for example, apps can limit the scope of the SMC to a specific IP address or port.

v2: net/ncsi: clear Tx enable mode when handling a Config required AEN

ncsi_channel_is_tx() determines whether a given channel should be used for Tx or not. However, when reconfiguring the channel by handling a Configuration Required AEN, there is a misjudgment that the channel Tx has already been enabled, which results in the Enable Channel Network Tx command not being sent.

v1: net: phy: aquantia: Add 10mbps support

This adds support for 10mbps speed in PHY device’s “supported” field which helps in autonegotiating 10mbps link from PHY side where PHY supports the speed but not updated in PHY kernel framework.

One such example is AQR113C PHY.

v2: net-next: net: phy: hide the PHYLIB_LEDS knob

commit 4bb7aac70b5d (“net: phy: fix circular LEDS_CLASS dependencies”) solved a build failure, but introduces a new config knob with a default ‘y’ value: PHYLIB_LEDS.

安全增强

GIT PULL: flexible-array transformations for 6.4-rc1

The following changes since commit fe15c26ee26efa11741a7b632e9f23b01aca4cc6:

Linux 6.3-rc1 (2023-03-05 14:52:03 -0800)

are available in the Git repository at:

git://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git tags/flex-array-transformations-6.4-rc1

for you to fetch changes up to 00168b415a60cec7558608efb4fc50f2a73daae2:

异步 IO

v3: io_uring: Pass the whole sqe to commands

These three patches prepare for the sock support in the io_uring cmd, as described in the following RFC:

https://lore.kernel.org/lkml/20230406144330.1932798-1-leitao@debian.org/

Since the support linked above depends on other refactors, such as the sock ioctl() sock refactor[1], I would like to start integrating patches that have consensus and can bring value right now. This will also reduce the patchset size later.

v1: Rethinking splice

IORING_OP_SPLICE has problems, many of them are fundamental and rooted in the uapi design, see the patch 8 description. This patchset introduces a different approach, which came from discussions about splices and fused commands and absorbed ideas from both of them. We remove reliance onto pipes and registering “spliced” buffers with data as an io_uring’s registered buffer. Then the user can use it as a usual registered buffer, e.g. pass it to IORING_OP_WRITE_FIXED.

v1: io_uring attached nvme queue

Also, io_uring ring is not to be shared among application threads. Application is responsible for building the sharing (if it feels the need). This means ring-associated exclusive queue can do away with some synchronization costs that occur for shared queue.

v11: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

v1: io_uring: Add io_uring_setup flag to pre-register ring fd and never install it

With IORING_REGISTER_USE_REGISTERED_RING, an application can register the ring fd and use it via registered index rather than installed fd. This allows using a registered ring for everything except the initial mmap.

v10: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

v9: liburing: add api for napi busy poll

This adds two new api’s to set/clear the napi busy poll settings. The two new functions are called:

  • io_uring_register_napi
  • io_uring_unregister_napi

The patch series also contains the documentation for the two new functions and two example programs. The client program is called napi-busy-poll-client and the server program napi-busy-poll-server. The client measures the roundtrip times of requests.

Rust For Linux

v3: rust: helpers: sort includes alphabetically in rust/helpers.c

Sort the #include directives of rust/helpers.c alphabetically and add a comment specifying this. The reason for this is to improve readability and to be consistent with the other files with a similar approach within ‘rust/’.

v1: rust: Sort rust/helpers.c’s #include directives

Sort the #include directives of rust/helpers.c alphabetically and add a comment specifying this.

BPF

v3: bpf-next: Handle immediate reuse in bpf memory allocator

As discussed in v1, currently the freed objects in bpf memory allocator may be reused immediately by the new allocation, it introduces use-after-bpf-ma-free problem for non-preallocated hash map and makes lookup procedure return incorrect result. The immediate reuse also makes introducing new use case more difficult (e.g. qp-trie).

v1: bpf-next: libbpf: capability for resizing datasec maps

The thought behind this is to allow for use cases where a given datasec needs to scale to for example the number of CPU’s present. A bpf program can have a global array in a custom data section with an initial length and before loading the bpf program, the array length could be extended to match the CPU count. The selftests included in this series perform this scaling to an arbitrary value to demonstrate how it can work.

v1: x86/pie: Make kernel image’s virtual address flexible

These patches make the changes necessary to build the kernel as Position Independent Executable (PIE) on x86_64. A PIE kernel can be relocated below the top 2G of the virtual address space. And this patchset provides an example to allow kernel image to be relocated in top 512G of the address space.

v1: bpf-next: selftests/bpf: Add fexit_sleep to DENYLIST.aarch64

It is reported that the fexit_sleep never returns in aarch64. The remaining tests cannot start. Put this test into DENYLIST.aarch64 for now so that other tests can continue to run in the CI.

v2: bpf-next: libbpf: btf_dump_type_data_check_overflow needs to consider BTF_MEMBER_BITFIELD_SIZE

The reason is in btf_dump_type_data_check_overflow(). It does not use BTF_MEMBER_BITFIELD_SIZE from the struct’s member (btf_member). Instead, it is using the enum size which is 4. It had been working till the recent commit 4e04143c869c (“fs_context: drop the unused lsm_flags member”) removed an integer member which also removed the 4 bytes padding at the end of the fs_context. Missing this 4 bytes padding exposed this bug. In particular, when btf_dump_type_data_check_overflow() reaches the member ‘phase’, -E2BIG is returned.

v2: bpf-next: selftests/bpf: test_progs can read test lists from file

BPF selftests have ALLOWLIST and DENYLIST files, used to control which tests are run in CI. These files are currently parsed by a shell script. [1]

This patchset allows those files to be specified directly on the test_progs command line (eg, as -a @ALLOWLIST).

v2: bpf-next: bpf: Don’t EFAULT for {g,s}setsockopt with wrong optlen

optval larger than PAGE_SIZE leads to EFAULT if the BPF program isn’t careful enough. This is often overlooked and might break completely unrelated socket options. Instead of EFAULT, let’s ignore BPF program buffer changes. See the first patch for more info.

v1: selftests/bpf: Do not use sign-file as testcase

The sign-file utility (from scripts/) is used in prog_tests/verify_pkcs7_sig.c, but the utility should not be called as a test. Executing this utility produces the following error:

selftests: /linux/tools/testing/selftests/bpf: urandom_read ok 16 selftests: /linux/tools/testing/selftests/bpf: urandom_read

selftests: /linux/tools/testing/selftests/bpf: sign-file not ok 17 selftests: /linux/tools/testing/selftests/bpf: sign-file # exit=2

v4: bpf-next: bpftool: Dump map id instead of value for map_of_maps types

When using bpftool map dump with map_of_maps, it is usually more convenient to show the inner map id instead of raw value.

We are changing the plain print behavior to show inner_map_id instead of hex value, this would help with quick look up of inner map with bpftool map dump id <inner_map_id>. To avoid disrupting scripted behavior, we will add a new inner_map_id field to json output instead of replacing value.

v2: bpf-next: bpf: Make bpf_helper_defs.h c++ friendly

Compiling C++ BPF programs with existing bpf_helper_defs.h is not possible due to stricter C++ type conversions. C++ complains about (void *) type conversions:

$ clang++ –include linux/types.h ./tools/lib/bpf/bpf_helper_defs.h

v1: bpf-next: Add precision propagation for subprogs and callbacks

As more and more real-world BPF programs become more complex and increasingly use subprograms (both static and global), scalar precision tracking and its (previously weak) support for BPF subprograms (and callbacks as a special case of that) is becoming more and more of an issue and limitation. Couple that with increasing reliance on state equivalence (BPF open-coded iterators have a hard requirement for state equivalence to converge and successfully validate loops), and it becomes pretty critical to address this limitation and make precision tracking universally supported for BPF programs of any complexity and composition.

v1: KEYS: Introduce user mode key and signature parsers

Support new key and signature formats with the same kernel component.

Verify the authenticity of system data with newly supported data formats.

Mitigate the risk of parsing arbitrary data in the kernel.

v7: vhost: virtio core prepares for AF_XDP

Now, virtio may can not work with DMA APIs when virtio features do not have VIRTIO_F_ACCESS_PLATFORM.

  1. I tried to let DMA APIs return phy address by virtio-device. But DMA APIs just work with the “real” devices.
  2. I tried to let xsk support callballs to get phy address from virtio-net driver as the dma address. But the maintainers of xsk may want to use dma-buf to replace the DMA APIs. I think that may be a larger effort. We will wait too long.

v2: powerpc/bpf: populate extable entries only during the last pass

Since commit 85e031154c7c (“powerpc/bpf: Perform complete extra passes to update addresses”), two additional passes are performed to avoid space and CPU time wastage on powerpc. But these extra passes led to WARN_ON_ONCE() hits in bpf_add_extable_entry() as extable entries are populated again, during the extra pass, without resetting the index. Fix it by resetting entry index before repopulating extable entries, if and when there is an additional pass.

v1: bpf-next: selftests/bpf: avoid mark_all_scalars_precise() trigger in one of iter tests

For now, change the test to assume fixed size of passed in array. Once BPF verifier supports precision tracking across subprogram calls, these changes will be reverted as unnecessary.

[RFC/PATCH bpf-next 00/20] bpf: Add multi uprobe link

this patchset is adding support to attach multiple uprobes and usdt probes through new uprobe_multi link.

The current uprobe is attached through the perf event and attaching many uprobes takes a lot of time because of that.

v6: tracing: Add fprobe events

Here is the 6th version of improve fprobe and add a basic fprobe event support for ftrace (tracefs) and perf. Here is the previous version.

https://lore.kernel.org/all/168198993129.1795549.8306571027057356176.stgit@mhiramat.roam.corp.google.com/

v2: libbpf: Improve version handling when attaching uprobe

This change fixes the handling of versions in elf_find_func_offset. In the previous implementation, we incorrectly assumed that the version information would be present in the string found in the string table.

v1: bpf-next: xsk: Use pool->dma_pages to check for DMA

Compare pool->dma_pages instead of pool->dma_pages_cnt to check for an active DMA mapping. pool->dma_pages needs to be read anyway to access the map so this compiles to more efficient code.

周边技术动态

Qemu

v1: target/riscv: RVV 1-fill tail element changes

This series makes changes in vext_set_tail_elements_1s() to be a little nicer to the emulation.

First patch makes the function a no-op when vta == 0. Aside from the logic simplification we also have a little performance boost.

v2: hw/riscv: virt: Assume M-mode FW in pflash0 only when “-bios none”

Currently, virt machine supports two pflash instances each with 32MB size. However, the first pflash is always assumed to contain M-mode firmware and reset vector is set to this if enabled. Hence, for S-mode payloads like EDK2, only one pflash instance is available for use. This means both code and NV variables of EDK2 will need to use the same pflash.

v3: hw/riscv/virt: Add a second UART for secure world

The virt machine can have two UARTs and the second UART can be used by the secure payload, firmware or OS residing in secure world. Will include the UART device to FDT in a seperated patch.

v1: Add RISC-V KVM AIA Support

This series introduces support for KVM AIA in the RISC-V architecture. The implementation is refered to Anup’s KVM AIA implementation in kvmtool (https://github.com/avpatel/kvmtool.git). To test these patches, a Linux kernel with KVM AIA support is required, which can be found in the qemu_kvm_aia branch at https://github.com/yong-xuan/linux.git. This kernel branch is based on the riscv_aia_v1 branch from https://github.com/avpatel/linux.git and includes two additional patches.

U-Boot

v3: Add ethernet driver for StarFive JH7110 SoC

This series of patches base on the latest branch/master,and adds ethernet support for the StarFive JH7110 RISC-V SoC. The series includes EEPROM, PHY and MAC drivers. The PHY model is YT8531 (from Motorcomm Inc), and the MAC version is dwmac-5.20 (from Synopsys DesignWare).

v5: Add StarFive JH7110 PCIe drvier support

The PCIe driver depends on gpio, pinctrl, clk and reset driver to do init. The PCIe dts configuation includes all these setting.



Read Album:

Read Related:

Read Latest: