泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 46 期

呀呀呀 创作于 2023/05/22

时间:20230521
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v1: tools/nolibc: autodetect stackprotector availability from compiler

As suggested by Willy it is possible to detect the availability of stackprotector via preprocessor defines. Make use of that to simplify the code and interface of nolibc.

v1: RISC-V: KVM: Redirect AMO load/store misaligned traps to guest

The M-mode redirects an unhandled misaligned trap back to S-mode when not delegating it to VS-mode(hedeleg). However, KVM running in HS-mode terminates the VS-mode software when back from M-mode. The KVM should redirect the trap back to VS-mode, and let VS-mode trap handler decide the next step. Here is a way to handle misaligned traps in KVM, not only directing them to VS-mode or terminate it.

v1: perf parse-regs: Refactor arch related functions

The register parsing have two levels: one level is under ‘arch’ folder, another level is under ‘util’ folder. A good design is ‘arch’ folder handles architecture specific operations and provides APIs for upper layer, on the other hand, ‘util’ folder should be general and simply calls APIs to talk to arch layer.

v1: riscv: hibernation: Replace jalr with jr before suspend_restore_regs

No need to link the x1/ra reg via jalr before suspend_restore_regs So it’s better to replace jalr with jr.

v2: Add Sipeed Lichee Pi 4A RISC-V board support

Sipeed’s Lichee Pi 4A development board uses Lichee Module 4A core module which is powered by T-HEAD’s TH1520 SoC. Add minimal device tree files for the core module and the development board.

v1: riscv: Allow disable vdso support

This is part of my tinylinux work for RISC-V, see related patchsets:

  • RISC-V: Enable dead code elimination, v3 [1]
  • tools/nolibc: riscv: Fix up compile error for rv32, v1 [2]
  • Add dead syscalls elimination support, RFC [3]

v20: -next: riscv: Add vector ISA support

This patchset is implemented based on vector 1.0 spec to add vector support in riscv Linux kernel. There are some assumptions for this implementations.

v4: riscv: add Bouffalolab bl808 support

This series adds Bouffalolab uart driver and basic devicetrees for Bouffalolab bl808 SoC and Sipeed M1s dock board.

v1: riscv: s64ilp32: Running 32-bit Linux kernel on 64-bit supervisor mode

This patch series adds s64ilp32 support to riscv. The term s64ilp32 means smode-xlen=64 and -mabi=ilp32 (ints, longs, and pointers are all 32-bit), i.e., running 32-bit Linux kernel on pure 64-bit supervisor mode. There have been many 64ilp32 abis existing, such as mips-n32 [1], arm-aarch64ilp32 [2], and x86-x32 [3], but they are all about userspace. Thus, this should be the first time running a 32-bit Linux kernel with the 64ilp32 ABI at supervisor mode (If not, correct me).

v18: Microchip Soft IP corePWM driver

Another version, although a lot smaller of a range-diff than previously! All you get this time is the one change requested by Uwe on v17, along with a rebase on -rc1.

v6: Add JH7110 USB and USB PHY driver support

This patchset adds USB driver and USB PHY for the StarFive JH7110 SoC. USB work mode is peripheral and using USB 2.0 PHY in VisionFive 2 board. The patch has been tested on the VisionFive 2 board.

v6: Add STG/ISP/VOUT clock and reset drivers for StarFive JH7110

This patch serises are base on the basic JH7110 SYSCRG/AONCRG drivers and add new partial clock drivers and reset supports about System-Top-Group(STG), Image-Signal-Process(ISP) and Video-Output(VOUT) for the StarFive JH7110 RISC-V SoC. These clocks and resets could be used by DMA, VIN and Display modules.

v1: dt-bindings: riscv: deprecate riscv,isa

When the RISC-V dt-bindings were accepted upstream in Linux, the base ISA etc had yet to be ratified. By the ratification of the base ISA, incompatible changes had snuck into the specifications - for example the Zicsr and Zifencei extensions were spun out of the base ISA.

v1: RISC-V KVM in-kernel AIA irqchip

This series adds in-kernel AIA irqchip which only trap-n-emulate IMSIC and APLIC MSI-mode for Guest. The APLIC MSI-mode trap-n-emulate is optional so KVM user space can emulate APLIC entirely in user space.

v3: RISC-V: Enable dead code elimination

Select CONFIG_HAVE_LD_DEAD_CODE_DATA_ELIMINATION for RISC-V, allowing the user to enable dead code elimination. In order for this to work, ensure that we keep the alternative table by annotating them with KEEP.

v3: perf vendor events riscv: add T-HEAD C9xx JSON file

These events are the max that c9xx series support. Since T-HEAD let manufacturers decide whether events are usable, the final support of the perf events is determined by the pmu node of the soc dtb.

v1: irq_work: consolidate arch_irq_work_raise prototypes

The prototype was hidden on x86, which causes a warning:

kernel/irq_work.c:72:13: error: no previous prototype for ‘arch_irq_work_raise’ [-Werror=missing-prototypes]

Fix this by providing it in only one place that is always visible.

v1: perf: add T-HEAD C9xx series cpu support

The T-HEAD C9xx series cpu is a series of riscv CPU IP. As this IP was proposed before the current riscv event standard. It has a non-standard events encoding for perf events and unimplemented MARCH and MIMP CSR. This patch add these events to support C9xx cpus.

进程调度

v1: RESEND: sched/nohz: Add HRTICK_BW for using cfs bandwidth with nohz_full

CFS bandwidth limits and NOHZ full don’t play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Use the hrtick mechanism to set a sched tick to fire at remaining_runtime in the future if we are on a nohz full cpu, if the task has quota and if we are likely to disable the tick (nr_running == 1). This allows for bandwidth accounting before tasks go too far over quota.

v1: sched: core: Simplify cpuset_cpumask_can_shrink()

Remove useless intermediate variable “ret” and its initialization. Directly return dl_cpuset_cpumask_can_shrink() result.

v1: sched/rt: Print curr when RT throttling activated

We may meet the issue, that one RT thread occupied the cpu by 950ms/1s, The RT thread maybe is a business thread or other unknown thread.

Currently, it only outputs the print “sched: RT throttling activated” when RT throttling happen. It is hard to know what is the RT thread, For further analysis, we need add more prints.

v1: sched/fair: Introduce SIS_PAIR to wakeup task on local idle core first

The will-it-scale context_switch1 test case exposes the issue. The test platform has 2 x 56C/112T and 224 CPUs in total. To evaluate the C2C overhead within 1 LLC, will-it-scale was tested with 1 socket/node online, so there are 56C/112T CPUs when running will-it-scale.

v3: sched: Consider CPU contention in frequency, EAS max util & load-balance busiest CPU selection

This is the implementation of the idea to factor in CPU runnable_avg into the CPU utilization getter functions (so called ‘runnable boosting’) as a way to consider CPU contention for:

(a) CPU frequency(b) EAS’ max util and(c) ‘migrate_util’ type load-balance busiest CPU selection.

v1: sched/fair: Consider asymmetric scheduler groups in load balancer

The current load balancer implementation implies that scheduler groups, within the same scheduler domain, all host the same number of CPUs.

This appears to be valid for non-s390 architectures. Nevertheless, s390 can actually have scheduler groups of unequal size. The current scheduler behavior causes some s390 configs to use SMT while some cores are still idle, leading to a performance degredation under certain levels of workload.

GIT PULL: sched/urgent for v6.4-rc2

please pull an urgent (oh well :)) sched fix for 6.4.

Thx.

内存管理

v21: splice: Kill ITER_PIPE

I’ve split off splice patchset and moved the block patches to a separate branch (though they are dependent on this one).

This patchset kills off ITER_PIPE to avoid a race between truncate, iov_iter_revert() on the pipe and an as-yet incomplete DMA to a bio with unpinned/unref’ed pages from an O_DIRECT splice read. This causes memory corruption[2]. Instead, we use filemap_splice_read(), which invokes the buffered file reading code and splices from the pagecache into the pipe; copy_splice_read(), which bulk-allocates a buffer, reads into it and then pushes the filled pages into the pipe; or handle it in filesystem-specific code.

v2: change ->index to PAGE_SIZE for hugetlb pages

This patchset adds new wrappers for hugetlb code to to interact with the page cache. These wrappers calculate a linear page index as this is now what the page cache expects for hugetlb pages as well.

v2: Optimize mremap during mutual alignment within PMD

Here is v2 of the mremap start address optimization / fix for exec warning.

  1. Fix issue with bogus return value found by Linus if we broke out of the above loop for the first PMD itself.

v1: mm: compaction: avoid GFP_NOFS ABBA deadlock

During stress testing with higher-order allocations, a deadlock scenario was observed in compaction: One GFP_NOFS allocation was sleeping on mm/compaction.c::too_many_isolated(), while all CPUs in the system were busy with compactors spinning on buffer locks held by the sleeping GFP_NOFS allocation.

v4: memblock: Add flags and nid info in memblock debugfs

Currently, the memblock debugfs can display the count of memblock_type and the base and end of the reg. However, when memblock_mark_*() or memblock_set_node() is executed on some range, the information in the existing debugfs cannot make it clear why the address is not consecutive.

v1: mm,page_owner: mark page_owner_threshold helpers as static

The newly added functions have no prototype:

mm/page_owner.c:748:5: error: no previous prototype for ‘page_owner_threshold_get’ [-Werror=missing-prototypes] mm/page_owner.c:754:5: error: no previous prototype for ‘page_owner_threshold_set’ [-Werror=missing-prototypes]

v1: iov_iter: Add automatic-alloc for ITER_BVEC and use in direct_splice_read()

If it’s a problem that direct_splice_read() always allocates as much memory as is asked for and that will fit into the pipe when less could be allocated in the case that, say, an O_DIRECT-read will hit a hole and do a short read or a socket will return less than was asked for, something like the attached modification to ITER_BVEC could be made.

v4: mm, dma, arm64: Reduce ARCH_KMALLOC_MINALIGN to 8

That’s the fourth version of the series reducing the kmalloc() minimum alignment on arm64 to 8 (from 128).

The first 10 patches decouple ARCH_KMALLOC_MINALIGN from ARCH_DMA_MINALIGN and, for arm64, it limits the kmalloc() caches to those aligned to the run-time probed cache_line_size(). The advantage on arm64 is that we gain the kmalloc-{64,192} caches.

v1: mm: page_alloc: set sysctl_lowmem_reserve_ratio storage-class-specifier to static

smatch reports mm/page_alloc.c:247:5: warning: symbol‘sysctl_lowmem_reserve_ratio’ was not declared. Should it be static?

This variable is only used in its defining file, so it should be static

v1: mm/page_owner: set page_owner_* storage-class-specifier to static

smatch reports mm/page_owner.c:739:30: warning: symbol‘page_owner_stack_operations’ was not declared. Should it be static? mm/page_owner.c:748:5: warning: symbol‘page_owner_threshold_get’ was not declared. Should it be static? mm/page_owner.c:754:5: warning: symbol‘page_owner_threshold_set’ was not declared. Should it be static?

v9: net-next: splice, net: Replace sendpage with sendmsg(MSG_SPLICE_PAGES), part 1

Here’s the first tranche of patches towards providing a MSG_SPLICE_PAGES internal sendmsg flag that is intended to replace the ->sendpage() op with calls to sendmsg(). MSG_SPLICE_PAGES is a hint that tells the protocol that it should splice the pages supplied if it can and copy them if not.

文件系统

v1: Create large folios in iomap buffered write path

Wang Yugui has a workload which would be improved by using large folios. Until now, we’ve only created large folios in the readahead path, but this workload writes without reading. The decision of what size folio to create is based purely on the size of the write() call (unlike readahead where we keep history and can choose to create larger folios based on that history even if individual reads are small).

v1: cachefiles: Allow the cache to be non-root

Set mode 0600 on files in the cache so that cachefilesd can run as an unprivileged user rather than leaving the files all with 0. Directories are already set to 0700.

v2: bpf-next: Add O_PATH-based BPF_OBJ_PIN and BPF_OBJ_GET support

Add ability to specify pinning location within BPF FS using O_PATH-based FDs, similar to openat() family of APIs. Patch #1 adds necessary kernel-side changes. Patch #2 exposes this through libbpf APIs. Patch #3 uses new mount APIs (fsopen, fsconfig, fsmount) to demonstrated how now it’s possible to work with detach-mounted BPF FS using new BPF_OBJ_PIN and BPF_OBJ_GET functionality.

v2: Documentation: add initial iomap kdoc

To help with iomap adoption / porting I set out the goal to try to help improve the iomap documentation and get general guidance for filesystem conversions over from buffer-head in time for this year’s LSFMM. The end results thanks to the review of Darrick, Christoph and others is on the kernelnewbies wiki [0].

v1: squashfs: don’t include buffer_head.h

Squashfs has stopped using buffers heads in 93e72b3c612adcaca1 (“squashfs: migrate from ll_rw_block usage to BIO”).

v1: gfs2/buffer folio changes

This kind of started off as a gfs2 patch series, then became entwined with buffer heads once I realised that gfs2 was the only remaining caller of __block_write_full_page(). For those not in the gfs2 world, the big point of this series is that block_write_full_page() should now handle large folios correctly.

v4: memcontrol: support cgroup level OOM protection

Establish a new OOM score algorithm, supports the cgroup level OOM protection mechanism. When an global/memcg oom event occurs, we treat all processes in the cgroup as a whole, and OOM killers need to select the process to kill based on the protection quota of the cgroup.

v1: ACPI: APEI: EINJ: Add support for vendor defined error types

Noted. The only checkpatch warning that was ignored was pertaining to the usage of S_IWUSR macro with debugfs_create_blob. Had noticed that a majority of einj module's debugfs files have been created with S_IRUSR and S_IWUSR macros. So used them to maintain uniformity.
Will switch to octal permissions though.

v1: procfs: consolidate arch_report_meminfo declaration

The arch_report_meminfo() function is provided by four architectures, with a __weak fallback in procfs itself. On architectures that don’t have a custom version, the __weak version causes a warning because of the missing prototype.

v1: radix-tree: move declarations to header

The xarray.c file contains the only call to radix_tree_node_rcu_free(), and it comes with its own extern declaration for it. This means the function definition causes a missing-prototype warning:

lib/radix-tree.c:288:6: error: no previous prototype for ‘radix_tree_node_rcu_free’ [-Werror=missing-prototypes]

网络设备

v5: iproute2-next: ip-link: add support for nolocalbypass in vxlan

Add userspace support for the [no]localbypass vxlan netlink attribute. With localbypass on (default), the vxlan driver processes the packets destined to the local machine by itself, bypassing the userspace nework stack. With nolocalbypass the packets are always forwarded to the userspace network stack, so userspace programs, such as tcpdump have a chance to process them.

v1: net-next: nfc: Switch i2c drivers back to use .probe()

After commit b8a1a4cd5a98 (“i2c: Provide a temporary .probe_new() call-back type”), all drivers being converted to .probe_new() and then convert back to (the new) .probe() to be able to eventually drop .probe_new() from struct i2c_driver.

v1: net-next: net: phylink: require supported_interfaces to be filled

We have been requiring the supported_interfaces bitmap to be filled in by MAC drivers that have a mac_select_pcs() method. Now that all MAC drivers fill in the supported_interfaces bitmap, it is time to enforce this. We have already required supported_interfaces to be set in order for optical SFPs to be configured in commit f81fa96d8a6c (“net: phylink: use phy_interface_t bitmaps for optical modules”).

v1: net-next: net: sfp: add support for a couple of copper multi-rate modules

Add support for the Fiberstore SFP-10G-T and Walsun HXSX-ATRC-1 modules. Internally, the PCB silkscreen has what seems to be a part number of WT_502. Fiberstore use v2.2 whereas Walsun use v2.6.

v1: net: macb: use correct __be32 and __be16 types

This patch fixes the following sparse warnings. No functional changes.

Use cpu_to_be16() and cpu_to_be32() to convert constants before comparing them with __be16 type of psrc/pdst and __be32 type of ip4src/ip4dst. Apply be16_to_cpu() in GEM_BFINS().

v7: virtio: pds_vdpa driver

This patchset implements a new module for the AMD/Pensando DSC that supports vDPA services on PDS Core VF devices. This code is based on and depends on include files from the pds_core driver described here[0]. The pds_core driver creates the auxiliary_bus devices that this module connects to, and this creates vdpa devices for use by the vdpa module.

v2: can: esd_usb: More preparation before supporting esd CAN-USB/3

Apply another small batch of patches as preparation for adding support of the newly available esd CAN-USB/3 to esd_usb.c.

v1: net-next: net/mlx5: Introduce SF direction

Whenever multiple Virtual Network functions (VNFs) are used by Service Function Chaining (SFC), each packet is passing through all the VNFs, and each VNF is performing hairpin in order to pass the packet to the next function in the chain.

v1: net: rtnetlink: not allow dev gro_max_size to exceed GRO_MAX_SIZE

In commit 0fe79f28bfaf (“net: allow gro_max_size to exceed 65536”), it limited GRO_MAX_SIZE to (8 * 65535) to avoid overflows, but also deleted the check of GRO_MAX_SIZE when setting the dev gro_max_size.

v1: net-next: i40e: add PHY debug register dump

Implement ethtool register dump for some PHY registers in order to assist field debugging of link issues.

v1: net-next:pull request: ice: allow matching on meta data

This patchset is intended to improve the usability of the switchdev slow path. Without matching on a meta data values slow path works based on VF’s MAC addresses. It causes a problem when the VF wants to use more than one MAC address (e.g. when it is in trusted mode).

v2: net-next: net: dsa: mv88e6xxx: add 88E6361 support

This series brings initial support for Marvell 88E6361 switch.

MV88E6361 is a 8 ports switch with 5 integrated Gigabit PHYs and 3 2.5Gigabit SerDes interfaces. It is in fact a new variant in the

  • port 0: MII, RMII, RGMII, 1000BaseX, 2500BaseX
  • port 3 to 7: triple speed internal phys
  • port 9 and 10: 1000BaseX, 25000BaseX

v1: net-next: TCP splice improvements

The main part is in Patch 1, which optimises locking for successful blocking TCP splice read, following with a clean up in Patch 2.

v1: net-next: net/tcp: refactor tcp_inet6_sk()

Don’t keep hand coded offset caluclations and replace it with container_of(). It should be type safer and a bit less confusing.

It also makes it with a macro instead of inline function to preserve constness, which was previously casted out like in case of tcp_v6_send_synack().

v1: net-next: net: phy: add helpers for comparing phy IDs

There are several places which open code comparing PHY IDs. Provide a couple of helpers to assist with this, using a slightly simpler test than the original:

  • phy_id_compare() compares two arbitary PHY IDs and a mask of the significant bits in the ID.
  • phydev_id_compare() compares the bound phydev with the specified PHY ID, using the bound driver’s mask.

v4: net-next: Fine-Tune Flow Control and Speed Configurations in Microchip KSZ8xxx DSA Driver

change v4:

  • instead of downstream/upstream use CPU-port and PHY-port
  • adjust comments
  • minor fixes

v3: net: stmmac: compare p->des0 and p->des1 with __le32 type values

Use cpu_to_le32 to convert the constants to __le32 type before comparing them with p->des0 and p->des1 (they are __le32 type) and to fix following sparse warnings:

drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c:110:23: sparse: warning: restricted __le32 degrades to integer drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c:110:50: sparse: warning: restricted __le32 degrades to integer

v1: [net-next] net: ipconfig: move ic_nameservers_fallback into #ifdef block

The new variable is only used when IPCONFIG_BOOTP is defined and otherwise causes a warning:

net/ipv4/ipconfig.c:177:12: error: ‘ic_nameservers_fallback’ defined but not used [-Werror=unused-variable]

Move it next to the user.

v2: net-next: net: fec: turn on XDP features

The XDP features are supported since the commit 66c0e13ad236 (“drivers: net: turn on XDP features”). Currently, the fec driver supports NETDEV_XDP_ACT_BASIC, NETDEV_XDP_ACT_REDIRECT and NETDEV_XDP_ACT_NDO_XMIT. So turn on these XDP features for fec driver.

v1: net: stmmac: use le32_to_cpu for p->des0 and p->des1

Use le32_to_cpu for p->des0 and p->des1 to fix the following sparse warnings:

drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c:110:23: sparse: warning: restricted __le32 degrades to integer drivers/net/ethernet/stmicro/stmmac/dwxgmac2_descs.c:110:50: sparse: warning: restricted __le32 degrades to integer

v13: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

v6: Enable multiple MCAN on AM62x

On AM62x there are two MCANs in MCU domain. The MCANs in MCU domain were not enabled since there is no hardware interrupt routed to A53 GIC interrupt controller. Therefore A53 Linux cannot be interrupted by MCU MCANs.

v1: bpf-next: xsk: multi-buffer support

This series of patches add multi-buffer support for AF_XDP. XDP and various NIC drivers already have support for multi-buffer packets. With this patch set, programs using AF_XDP sockets can now also receive and transmit multi-buffer packets both in copy as well as zero-copy mode. ZC multi-buffer implementation is based on ice driver.

v1: nf: netfilter: ipset: Add schedule point in call_ad().

syzkaller found a repro that causes Hung Task [0] with ipset. The repro first creates an ipset and then tries to delete a large number of IPs from the ipset concurrently:

IPSET_ATTR_IPADDR_IPV4: 172.20.20.187IPSET_ATTR_CIDR: 2

[v3: net: fec: add dma_l.org/netdev/20230518150202.1920375-1-shenwei.wang@nxp.com/)

Two dma_wmb() are added in the XDP TX path to ensure proper ordering of descriptor and buffer updates:

  1. A dma_wmb() is added after updating the last BD to make sure the updates to rest of the descriptor are visible before transferring ownership to FEC.
  2. A dma_wmb() is also added after updating the bdp to ensure these updates are visible before updating txq->bd.cur.
  3. Start the xmit of the frame immediately right after configuring the tx descriptor.

v1: bpf: Use call_rcu_hurry() with synchronize_rcu_mult()

The bpf_struct_ops_map_free() function must wait for both an RCU grace period and an RCU Tasks grace period, and so it passes call_rcu() and call_rcu_tasks() to synchronize_rcu_mult(). This works, but on ChromeOS and Android platforms call_rcu() can have lazy semantics, resulting in multi-second delays between call_rcu() invocation and invocation of the corresponding callback.

GIT PULL: Networking for 6.4-rc3

The following changes since commit 6e27831b91a0bc572902eb065b374991c1ef452a:

Merge tag ‘net-6.4-rc2’ of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2023-05-11 08:42:47 -0500)

安全增强

v1: Memory Mapping (VMA) protection using PKU - set 1

We’re using PKU for in-process isolation to enforce control-flow integrity for a JIT compiler. In our threat model, an attacker exploits a vulnerability and has arbitrary read/write access to the whole process space concurrently to other threads being executed. This attacker can manipulate some arguments to syscalls from some threads.

v1: next: ALSA: mixart: Replace one-element arrays with simple object declarations

One-element arrays are deprecated, and we are replacing them with flexible array members, instead. However, in this case it seems those one-element arrays have never actually been used as fake flexible arrays.

v1: md/raid5: Convert stripe_head’s “dev” to flexible array member

Replace old-style 1-element array of “dev” in struct stripe_head with modern C99 flexible array. In the future, we can additionally annotate it with the run-time size, found in the “disks” member.

v1: kbuild: Enable -fstrict-flex-arrays=3

The -fstrict-flex-arrays=3 option is now available with the release of GCC 13[1] and Clang 16[2]. This feature instructs the compiler to treat only C99 flexible arrays as dynamically sized for the purposes of object size calculations. In other words, the ancient practice of using 1-element arrays, or the GNU extension of using 0-sized arrays, as a dynamically sized array is disabled. This allows CONFIG_UBSAN_BOUNDS, CONFIG_FORTIFY_SOURCE, and other object-size aware features to behave unambiguously in the face of trailing arrays: only C99 flexible arrays are considered to be dynamically sized.

v1: pid: Replace struct pid 1-element array with flex-array

For pid namespaces, struct pid uses a dynamically sized array member, “numbers”. This was implemented using the ancient 1-element fake flexible array, which has been deprecated for decades. Replace it with a C99 flexible array, refactor the array size calculations to use struct_size(), and address elements via indexes. Note that the static initializer (which defines a single element) works as-is, and requires no special handling.

v1: next: scsi: lpfc: Use struct_size() helper

Prefer struct_size() over open-coded versions of idiom:

sizeof(struct-with-flex-array) + sizeof(typeof-flex-array-elements) * count

where count is the max number of items the flexible array is supposed to contain.

v1: next: scsi: lpfc: Replace one-element array with flexible-array member

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element arrays with flexible-array members in a couple of structures, and refactor the rest of the code, accordingly.

v1: checkpatch: Check for strcpy and strncpy too

Warn about strcpy(), strncpy(), and strlcpy(). Suggest strscpy() and include pointers to the open KSPP issues for each, which has further details and replacement procedures.

v2: Compiler Attributes: Add __counted_by macro

In an effort to annotate all flexible array members with their run-time size information, the “element_count” attribute is being introduced by Clang[1] and GCC[2] in future releases. This annotation will provide the CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE features the ability to perform run-time bounds checking on otherwise unknown-size flexible arrays.

v1: next: media: venus: hfi_cmds: Replace fake flex-arrays with flexible-array members

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element arrays with flexible-array members in multiple structures.

v1: next: media: venus: hfi_cmds: Replace fake flex-array with flexible-array member

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element arrays with flexible-array members in struct hfi_sys_set_resource_pkt, and refactor the rest of the code, accordingly.

v1: next: media: venus: hfi_cmds: Use struct_size() helper

Prefer struct_size() over open-coded versions of idiom:

sizeof(struct-with-flex-array) + sizeof(typeof-flex-array-elements) * count

where count is the max number of items the flexible array is supposed to contain.

v1: next: media: venus: hfi_cmds: Replace one-element array with flexible-array member

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element arrays with flexible-array members in struct hfi_session_set_buffers_pkt, and refactor the rest of the code, accordingly.

v1: next: media: venus: Replace one-element arrays with flexible-array members

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element arrays with flexible-array members in multiple structures, and refactor the rest of the code, accordingly.

v1: next: iavf: Replace one-element array with flexible-array member

One-element arrays are deprecated, and we are replacing them with flexible array members instead. So, replace one-element array with flexible-array member in struct iavf_qvlist_info, and refactor the rest of the code, accordingly.

v1: next: wifi: wil6210: fw: Replace zero-length arrays with DECLARE_FLEX_ARRAY() helper

Zero-length arrays are deprecated, and we are moving towards adopting C99 flexible-array members, instead. So, replace zero-length arrays declarations alone in structs with the new DECLARE_FLEX_ARRAY() helper macro.

v1: next: wifi: wil6210: wmi: Replace zero-length array with DECLARE_FLEX_ARRAY() helper

Zero-length arrays are deprecated, and we are moving towards adopting C99 flexible-array members, instead. So, replace zero-length arrays declarations alone in structs with the new DECLARE_FLEX_ARRAY() helper macro.

v1: next: net: libwx: Replace zero-length array with flexible-array member

Zero-length arrays as fake flexible arrays are deprecated, and we are moving towards adopting C99 flexible-array members instead.

v1: next: mlxfw: Replace zero-length array with DECLARE_FLEX_ARRAY() helper

Zero-length arrays are deprecated and we are moving towards adopting C99 flexible-array members, instead. So, replace zero-length arrays declarations alone in structs with the new DECLARE_FLEX_ARRAY() helper macro.

异步 IO

v1: net-next: minor tcp io_uring zc optimisations

Patch 1 is a simple cleanup, patch 2 gives removes 2 atomics from the io_uring zc TCP submission path, which yielded extra 0.5% for my throughput CPU bound tests based on liburing/examples/send-zerocopy.c

v1: for-next: Enable IOU_F_TWQ_LAZY_WAKE for passthrough

Let cmds to use IOU_F_TWQ_LAZY_WAKE and enable it for nvme passthrough.

The result should be same as in test to the original IOU_F_TWQ_LAZY_WAKE [1] patchset, but for a quick test I took fio/t/io_uring with 4 threads each reading their own drive and all pinned to the same CPU to make it CPU bound and got +10% throughput improvement.

Rust For Linux

v1: Bindings for the workqueue

This patchset contains bindings for the kernel workqueue.

One of the primary goals behind the design used in this patch is that we must support embedding the work_struct as a field in user-provided types, because this allows you to submit things to the workqueue without having to allocate, making the submission infallible. If we didn’t have to support this, then the patch would be much simpler. One of the main things that make it complicated is that we must ensure that the function pointer in the work_struct is compatible with the struct it is contained within.

v1: rust: networking and crypto abstractions

This includes initial rust abstractions for networking and crypto.

I’ve been working on in-kernel TLS 1.3 handshake in Rust on the top of this. Currently you can run simple TLS server code, which does a handshake, sets up kTLS (Kernel TLS offload) to read and write some bytes.

BPF

v9: bpf-next: bpf: Add socket destroy capability

This patch set adds the capability to destroy sockets in BPF. We plan to use the capability in Cilium to force client sockets to reconnect when their remote load-balancing backends are deleted. The other use case is on-the-fly policy enforcement where existing socket connections prevented by policies need to be terminated.

v1: dwarves: Encoding function addresses using DECL_TAGs

As a means to continue the discussion in [1], which is concerned with finding the best long-term solution to having a BPF Type Format (BTF) representation of functions that is usable for tracing of edge cases, this proof-of-concept series is intended to explore one approach to adding information to help make tracing more accurate.

v2: bpf-next: bpftool: specify XDP Hints ifname when loading program

Add ability to specify a network interface used to resolve XDP Hints kfuncs when loading program through bpftool.

v1: bpf-next: selftests/bpf: add xdp_feature selftest for bond device

Introduce selftests to check xdp_feature support for bond driver.

v2: bpf-next: bpf: Show target_{obj,btf}_id for tracing link

The target_btf_id can help us understand which kernel function is linked by a tracing prog. The target_btf_id and target_obj_id have already been exposed to userspace, so we just need to show them.

v1: selftests/bpf: Do not use sign-file as testcase

The sign-file utility (from scripts/) is used in prog_tests/verify_pkcs7_sig.c, but the utility should not be called as a test. Executing this utility produces the following error:

v1: support non-frag page for page_pool_alloc_frag()

In [1], there is a use case to use frag support in page pool to reduce memory usage, and it may request different frag size depending on the head/tail room space for xdp_frame/shinfo and mtu/packet size. When the requested frag size is large enough that a single page can not be split into more than one frag, using frag support only have performance penalty because of the extra frag count handling for frag support.

v2: bpf-next: seltests/xsk: prepare for AF_XDP multi-buffer testing

Prepare the AF_XDP selftests test framework code for the upcoming multi-buffer support in AF_XDP. This so that the multi-buffer patch set does not become way too large. In that upcoming patch set, we are only including the multi-buffer tests together with any framework code that depends on the new options bit introduced in the AF_XDP multi-buffer implementation itself.

v1: bpf-next: selftests/bpf: improve netcnt test robustness

Change netcnt to demand at least 10K packets, as we frequently see some stray packet arriving during the test in BPF CI. It seems more important to make sure we haven’t lost any packet than enforcing exact number of packets.

v1: bpf: samples/bpf: use canonical fallthrough pseudo-keyword in hbm.c

Rename now unsupported __fallthrough into fallthrough ([0]) in samples/bpf/hbm.c to fix samples/bpf compilation.

[0] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

v2: iwl-net: ice: recycle/free all of the fragments from multi-buffer frame

The ice driver caches next_to_clean value at the beginning of ice_clean_rx_irq() in order to remember the first buffer that has to be freed/recycled after main Rx processing loop. The end boundary is indicated by first descriptor of frame that Rx processing loop has ended its duties. Note that if mentioned loop ended in the middle of gathering multi-buffer frame, next_to_clean would be pointing to the descriptor in the middle of the frame BUT freeing/recycling stage will stop at the first descriptor. This means that next iteration of ice_clean_rx_irq() will miss the (first_desc, next_to_clean - 1) entries.

v2: bpf-next: bpf: bpf trampoline improvements

When we run fexit bpf programs (e.g. attaching tcp_recvmsg) on our servers which were running old kernels, some of these servers crashed. Finally we figured out that it was caused by the same issue resolved by commit e21aa341785c (“bpf: Fix fexit trampoline.”). After we backported that commit, the crash disappears. However new issues are introduced by that commit. This patchset fixes them.

v1: bpf-next: bpf: btf: restore resolve_mode when popping the resolve stack

In commit 9b459804ff99 (“btf: fix resolving BTF_KIND_VAR after ARRAY, STRUCT, UNION, PTR”) I fixed a bug that occurred during resolving of a DATASEC by strategically resetting resolve_mode. This fixes the immediate bug but leaves us open to future bugs where nested types have to be resolved.

v1: Make fpobe + rethook immune to recursion

Current fprobe and rethook has some pitfalls and may introduce kernel stack recusion, especially in massive tracing scenario.

For example, if (DEBUG_PREEMPT | TRACE_PREEMPT_TOGGLE) , preempt_count_{add, sub} can be traced via ftrace, if we happens to use fprobe + rethook based on ftrace to hook on those functions, recursion is introduced in functions like rethook_trampoline_handler and leads to kernel crash because of stack overflow.

周边技术动态

Qemu

v1: hw/riscv/opentitan: Correct QOM type/size of OpenTitanState

This series fix a QOM issue with the OpenTitanState structure, noticed while auditing QOM relations globally.

v5: hw/riscv: qemu crash when NUMA nodes exceed available CPUs

Command “qemu-system-riscv64 -machine virt -m 2G -smp 1 -numa node,mem=1G -numa node,mem=1G” would trigger this problem.Backtrace with:#0 0x0000555555b5b1a4 in riscv_numa_get_default_cpu_node_id at ../hw/riscv/numa.c:211#1 0x00005555558ce510 in machine_numa_finish_cpu_init at ../hw/core/machine.c:1230#2 0x00005555558ce9d3 in machine_run_board_init at ../hw/core/machine.c:1346#3 0x0000555555aaedc3 in qemu_init_board at ../softmmu/vl.c:2513#4 0x0000555555aaf064 in qmp_x_exit_preconfig at ../softmmu/vl.c:2609#5 0x0000555555ab1916 in qemu_init at ../softmmu/vl.c:3617#6 0x000055555585463b in main at ../softmmu/main.c:47 This commit fixes the issue by adding parameter checks.

v1: Add RISC-V Virtual IRQs and IRQ filtering support

This series adds M and HS-mode virtual interrupt and IRQ filtering support. This allows inserting virtual interrupts from M/HS-mode into S/VS-mode using mvien/hvien and mvip/hvip csrs. IRQ filtering is a use case of this change, i-e M-mode can stop delegating an interrupt to S-mode and instead enable it in MIE and receive those interrupts in M-mode and then selectively inject the interrupt using mvien and mvip.

v9: target/riscv: rework CPU extension validation

In this version we have a change in patch 11. We’re now firing a GUEST_ERROR if write_misa() fails and we need to rollback (i.e. not change MISA ext).

U-Boot

v2: riscv: setup per-hart stack earlier

Harts need to use per-hart stack before any function call, even if that function is a simple one. When the callee uses stack for register save/ restore, especially RA, if nested call, concurrent access by multiple harts on the same stack will cause data-race.

v1: riscv: add backtrace support

When debugging, it is useful to have a backtrace to find out what is in the call stack as the previous function (RA) may not have been the culprit.



Read Album:

Read Related:

Read Latest: