泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 50 期

呀呀呀 创作于 2023/06/20

时间:20230618
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

文件系统

v1: d_path: include internal.h

Include internal.h to get the definition of simple_dname, to fix the following sparse warning:

fs/d_path.c:317:6: warning: symbol ‘simple_dname’ was not declared. Should it be static?

v1: fs: Provide helpers for manipulating sb->s_readonly_remount

Provide helpers to set and clear sb->s_readonly_remount including appropriate memory barriers. Also use this opportunity to document what the barriers pair with and why they are needed.

v5: dax: enable dax fault handler to report VM_FAULT_HWPOISON

Change from v4: Add comments describing when and why dax_mem2blk_err() is used. Suggested by Dan.

v19: Implement IOCTL to get and optionally clear info about PTEs

At this point, we left soft-dirty considering it is too much delicate and userfaultfd [9] seemed like the only way forward. From there onward, we have been basing soft-dirty emulation on userfaultfd wp feature where kernel resolves the faults itself when WP_ASYNC feature is used. It was straight forward to add WP_ASYNC feature in userfautlfd. Now we get only those pages dirty or written-to which are really written in reality. (PS There is another WP_UNPOPULATED userfautfd feature is required which is needed to avoid pre-faulting memory before write-protecting [9].)

v1: fs: Protect reconfiguration of sb read-write from racing writes

The reconfigure / remount code takes a lot of effort to protect filesystem’s reconfiguration code from racing writes on remounting read-only. However during remounting read-only filesystem to read-write mode userspace writes can start immediately once we clear SB_RDONLY flag. This is inconvenient for example for ext4 because we need to do some writes to the filesystem (such as preparation of quota files) before we can take userspace writes so we are clearing SB_RDONLY flag before we are fully ready to accept userpace writes and syzbot has found a way to exploit this [1]. Also as far as I’m reading the code the filesystem remount code was protected from racing writes in the legacy mount path by the mount’s MNT_READONLY flag so this is relatively new problem. It is actually fairly easy to protect remount read-write from racing writes using sb->s_readonly_remount flag so let’s just do that instead of having to workaround these races in the filesystem code.

v5: Handle notifications on overlayfs fake path files

A little while ago, Jan and I realized that an unprivileged overlayfs mount could be used to avert fanotify permission events that were requested for an inode or sb on the underlying fs.

v1: exfat: get file size from DataLength

From the exFAT specification, the file size should get from ‘DataLength’ of Stream Extension Directory Entry, not ‘ValidDataLength’.

v3: eventfd: add a uapi header for eventfd userspace APIs

Create a uapi header include/uapi/linux/eventfd.h, move the associated flags to the uapi header, and include it from linux/eventfd.h.

v1: fs: use helpers for opening kernel internal files

Overlayfs and cachefiles use vfs_open_tmpfile() to open a tmpfile without accounting for nr_files.

Rename this helper to kernel_tmpfile_open() to better reflect this helper is used for kernel internal users.

v1: RFC: high-order folio support for I/O

now, that was easy. Thanks to willy and his recent patchset to support large folios in gfs2 turns out that most of the work to support high-order folios for I/O is actually done. It only need twe rather obvious patches to allocate folios with the order derived from the mapping blocksize, and to adjust readahead to avoid reading off the end of the device.

v3: Add support for Vendor Defined Error Types in Einj Module

On 6/13/2023 03:01, Greg KH wrote:

On Mon, Jun 12, 2023 at 09:51:36PM +0000, Avadhut Naik wrote:

This patchset adds support for Vendor Defined Error types in the einj module by exporting a binary blob file in module’s debugfs directory. Userspace tools can write OEM Defined Structures into the blob file as part of injecting Vendor defined errors.

v1: Report on physically contiguous memory in smaps

This series adds new entries to /proc/pid/smaps[_rollup] to report on physically contiguous runs of memory. The first patch reports on the sizes of the runs by binning into power-of-2 blocks and reporting how much memory is in which bin. The second patch reports on how much of the memory is contpte-mapped in the page table (this is a hint that arm64 supports to tell the HW that a range of ptes map physically contiguous memory).

v1: errseq_t: split the ERRSEQ_SEEN flag into two

NFS wants to use the errseq_t mechanism to detect errors that occur during a write, but for that use-case we want to ignore anything that happened before the sample point.

v3: gfs2/buffer folio changes for 6.5

This kind of started off as a gfs2 patch series, then became entwined with buffer heads once I realised that gfs2 was the only remaining caller of __block_write_full_page(). For those not in the gfs2 world, the big point of this series is that block_write_full_page() should now handle large folios correctly.

v3: Create large folios in iomap buffered write path

The problem ends up being lock contention on the i_pages spinlock as we clear the writeback bit on each folio (and propagate that up through the tree). By using larger folios, we decrease the number of folios to be processed by a factor of 256 for this benchmark, eliminating the lock contention.

v2: Landlock support for UML

Commit cb2c7d1a1776 (“landlock: Support filesystem access-control”) introduced a new ARCH_EPHEMERAL_INODES configuration, only enabled for User-Mode Linux. The reason was that UML’s hostfs managed inodes in an ephemeral way: from the kernel point of view, the same inode struct could be created several times while being used by user space because the kernel didn’t hold references to inodes. Because Landlock (and probably other subsystems) ties properties (i.e. access rights) to inode objects, it wasn’t possible to create rules that match inodes and then allow specific accesses.

v1: block: Add config option to not allow writing to mounted devices

Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add config option to disallow writing to mounted (exclusively open) block devices. Syzbot can use this option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this config option for hardening.

v5: blksnap - block devices snapshots module

I am happy to offer a improved version of the Block Devices Snapshots Module. It allows to create non-persistent snapshots of any block devices. The main purpose of such snapshots is to provide backups of block devices. See more in Documentation/block/blksnap.rst.

v1: zonefs: set FMODE_CAN_ODIRECT instead of a dummy direct_IO method

Since commit a2ad63daa88b (“VFS: add FMODE_CAN_ODIRECT file flag”) file systems can just set the FMODE_CAN_ODIRECT flag at open time instead of wiring up a dummy direct_IO method to indicate support for direct I/O. Do that for zonefs so that noop_direct_IO can eventually be removed.

v1: fs: kernel and userspace filesystem freeze

Sometimes, kernel filesystem drivers need the ability to quiesce writes to the filesystem so that the driver can perform some kind of maintenance activity. This capability mostly already exists in the form of filesystem freezing but with the huge caveat that userspace can thaw any frozen fs at any time. If the correctness of the fs maintenance program requires stillness of the filesystem, then this caveat is BAD.

v1: nilfs2: prevent general protection fault in nilfs_clear_dirty_page()

In a syzbot stress test that deliberately causes file system errors on nilfs2 with a corrupted disk image, it has been reported that nilfs_clear_dirty_page() called from nilfs_clear_dirty_pages() can cause a general protection fault.

v1: eventfd: show flags in fdinfo

The flags should be displayed in fdinfo, as different flags could affect the behavior of eventfd.

v1: fsnotify: move fsnotify_open() hook into do_dentry_open()

fsnotify_open() hook is called only from high level system calls context and not called for the very many helpers to open files.

v1: sysctl: set variable sysctl_mount_point storage-class-specifier to static

smatch reports fs/proc/proc_sysctl.c:32:18: warning: symbol‘sysctl_mount_point’ was not declared. Should it be static?

This variable is only used in its defining file, so it should be static.

v1: blk: optimization for classic polling

This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.

网络设备

v2: net: revert “net: align SO_RCVMARK required privileges with SO_MARK”

This reverts commit 1f86123b9749 (“net: align SO_RCVMARK required privileges with SO_MARK”) because the reasoning in the commit message is not really correct:SO_RCVMARK is used for ‘reading’ incoming skb mark (via cmsg), as suchit is more equivalent to ‘getsockopt(SO_MARK)’ which has no priv checkand retrieves the socket mark, rather than ‘setsockopt(SO_MARK) whichsets the socket mark and does require privs.

v1: net-next: netlabel: Reorder fields in ‘struct netlbl_domaddr6_map’

Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of ‘struct netlbl_domaddr6_map’ from 72 to 64 bytes.

It saves a few bytes of memory and is more cache-line friendly.

v1: net-next: mptcp: Reorder fields in ‘struct mptcp_pm_add_entry’

Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of ‘struct mptcp_pm_add_entry’ from 136 to 128 bytes.

v1: net-next: mctp: Reorder fields in ‘struct mctp_route’

Group some variables based on their sizes to reduce hole and avoid padding. On x86_64, this shrinks the size of ‘struct mctp_route’ from 72 to 64 bytes.

v1: net-next: dt-bindings: net: bluetooth: qualcomm: document VDD_CH1

WCN3990 comes with two chains - CH0 and CH1 - where each takes VDD regulator. It seems VDD_CH1 is optional (Linux driver does not care about it), so document it to fix dtbs_check warnings like:

sdm850-lenovo-yoga-c630.dtb: bluetooth: ‘vddch1-supply’ does not match any of the regexes: ‘pinctrl-[0-9]+’

v1: net-next: net: phy: at803x: Use devm_regulator_get_enable_optional()

Use devm_regulator_get_enable_optional() instead of hand writing it. It saves some line of code.

v1: selftests: tc-testing: add one test for flushing explicitly created chain

Add the test for additional reference to chains that are explicitly createdby RTM_NEWCHAIN message

commit c9a82bec02c3 (“net/sched: cls_api: Fix lockup on flushing explicitlycreated chain”)

v3: net-next:pull request: Introduce Intel IDPF driver

This patch series introduces the Intel Infrastructure Data Path Function (IDPF) driver. It is used for both physical and virtual functions. Except for some of the device operations the rest of the functionality is the same for both PF and VF. IDPF uses virtchnl version2 opcodes and structures defined in the virtchnl2 header file which helps the driver to learn the capabilities and register offsets from the device Control Plane (CP) instead of assuming the default values.

v1: net: selftests/ptp: Add support for new timestamp IOCTLs

PTP_SYS_OFFSET_EXTENDED was added in November 2018 in and PTP_SYS_OFFSET_PRECISE was added in February 2016 in 719f1aa4a671 (“ptp: Add PTP_SYS_OFFSET_PRECISE for driver crosstimestamping”)

v8: net-next: Brcm ASP 2.0 Ethernet Controller

Add support for the Broadcom ASP 2.0 Ethernet controller which is first introduced with 72165.

2.7.4

[– Attachment #2: S/MIME Cryptographic Signature –] [– Type: application/pkcs7-signature, Size: 4206 bytes –]

v1: net-next: net: dqs: add NIC stall detector based on BQL

softnet_data->time_squeeze is sometimes used as a proxy for host overload or indication of scheduling problems. In practice this statistic is very noisy and has hard to grasp units - e.g. is 10 squeezes a second to be expected, or high?

v2: net-next: gro: move the tc_ext comparison to a helper

The double ifdefs (one for the variable declaration and one around the code) are quite aesthetically displeasing. Factor this code out into a helper for easier wrapping.

v5: net: phy: Add sysfs attribute for PHY c45 identifiers.

If a phydevice use c45, its phy_id property is always 0, so this adds a c45_ids sysfs attribute group contains mmd id attributes from mmd0 to mmd31 to MDIO devices. Note that only mmd with valid value will exist. This attribute group can be useful when debugging problems related to phy drivers.

v1: net-next: Add TJA1120 support

This patch series got bigger than I expected. It cleans up the next-c45-tja11xx driver and adds support for the TJA1120(1000BaseT1 automotive phy).

Master/slave custom implementation was replaced with the generic implementation (genphy_c45_config_aneg/genphy_c45_read_status).

v1: nfc: fdp: Add MODULE_FIRMWARE macros

The module loads firmware so add MODULE_FIRMWARE macros to provide that information via modinfo.

v1: ieee802154/adf7242: Add MODULE_FIRMWARE macro

The module loads firmware so add a MODULE_FIRMWARE macro to provide that information via modinfo.

v1: net-next: Add and use helper for PCS negotiation modes

Earlier this month, I proposed a helper for deciding whether a PCS should use inband negotiation modes or not. There was some discussion around this topic, and I believe there was no disagreement about providing the helper.

v1: net: dpaa2-mac: add 25gbase-r support

Layerscape MACs support 25Gbps network speed with dpmac “CAUI” mode. Add the mappings between DPMAC_ETH_IF_* and HY_INTERFACE_MODE_*, as well as the 25000 mac capability.

v1: net-next: dt-bindings: net: phy: gpy2xx: more precise description

Mention that the interrupt line is just asserted for a random period of time, not the entire time.

v2: drivers:net:ethernet:Add missing fwnode_handle_put()

In device_for_each_child_node(), we should have fwnode_handle_put() when break out of the iteration device_for_each_child_node() as it will automatically increase and decrease the refcounter.

v2: net: macsec SCI assignment for ES = 0

According to 802.1AE standard, when ES and SC flags in TCI are zero, used SCI should be the current active SC_RX. Current kernel does not implement it and uses the header MAC address.

v1: net-next: ipv6: also use netdev_hold() in ip6_route_check_nh()

In blamed commit, we missed the fact that ip6_validate_gw() could change dev under us from ip6_route_check_nh()

In this fix, I use GFP_ATOMIC in order to not pass too many additional arguments to ip6_validate_gw() and ip6_route_check_nh() only for a rarely used debug feature.

安全增强

v3: Randomized slab caches for kmalloc()

I adapted the v2 patch to the latest linux-next tree and made the v3 patch without “RFC”, since this idea seems to be acceptable in general based on previous dicussion with mm and hardening folks. Please check the link specified below for more details of the discussion, and further suggestions are welcome.

v3: usbip: usbip_host: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v2: tracing/boot: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v2: usb: gadget: function: printer: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v2: pstore/platform: Add check for kstrdup

Add check for the return value of kstrdup() and return the error if it fails in order to avoid NULL pointer dereference.

v2: usb: ch9: Replace 1-element array with flexible array

Since commit df8fc4e934c1 (“kbuild: Enable -fstrict-flex-arrays=3”), UBSAN_BOUNDS no longer pretends 1-element arrays are unbounded. Walking wData will trigger a warning, so make it a proper flexible array. Add a union to keep the struct size identical for userspace in case anything was depending on the old size.

v3: wifi: cfg80211: replace strlcpy() with strscpy()

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v2: wifi: cfg80211: replace strlcpy() with strlscpy()

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v3: SUNRPC: Use sysfs_emit in place of strlcpy/sprintf

Part of an effort to remove strlcpy() tree-wide [1].

Direct replacement is safe here since the getter in kernel_params_ops handles -errno return [2].

v1: pstore/ram: Add check for kstrdup

Add check for the return value of kstrdup() and return the error if it fails in order to avoid NULL pointer dereference.

v3: uml: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.

v1: SUNRPC: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: net/mediatek: strlcpy withreturn

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: netfilter: ipset: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: mac80211: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: ieee802154: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

v1: cfg80211: cfg80211: strlcpy withreturn

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy().

异步 IO

v2: add initial io_uring_cmd support for sockets

This patchset creates the initial plumbing for a io_uring command for sockets.

For now, create two uring commands for sockets, SOCKET_URING_OP_SIOCOUTQ and SOCKET_URING_OP_SIOCINQ, which are available in TCP, UDP and RAW sockets.

v1: io_uring/net: save msghdr->msg_control for retries

If the application sets ->msg_control and we have to later retry this command, or if it got queued with IOSQE_ASYNC to begin with, then we need to retain the original msg_control value. This is due to the net stack overwriting this field with an in-kernel pointer, to copy it in. Hitting that path for the second time will now fail the copy from user, as it’s attempting to copy from a non-user address.

Rust For Linux

v2: scripts/rust_is_available.sh improvements

This is the patch series to improve scripts/rust_is_available.sh.

The major addition in v2 is the test suite in the last commit. I added it because I wanted to have a proper way to test any further changes to it (such as the suggested set -- idea to avoid forking by Masahiro), and so that adding new checks was easier to justify too (i.e. vs. the added complexity).

v2: Rust abstractions for Crypto API

Before sending v2 of my crypto patch [1] to linux-crypto ml and checking the chance of Rust bindings for crypto being accepted, I’d like to iron out Rust issues. I’d appreciate any feedback.

v1: KUnit integration for Rust doctests

This is the initial KUnit integration for running Rust documentation tests within the kernel.

Thank you to the KUnit team for all the input and feedback on this over the months, as well as the Intel LKP 0-Day team!

v1: rust: make UnsafeCell the outer type in Opaque

When combining UnsafeCell with MaybeUninit, it is idiomatic to use UnsafeCell as the outer type. Intuitively, this is because a MaybeUninit<T> might not contain a T, but we always want the effect of the UnsafeCell, even if the inner value is uninitialized.

Now, strictly speaking, this doesn’t really make a difference. The compiler will always apply the UnsafeCell effect even if the inner value is uninitialized. But I think we should follow the convention here.

v1: rust: allocator: Prevents mis-aligned allocation

Currently the KernelAllocator simply passes the size of the type Layout to krealloc(), and in theory the alignment requirement from the type Layout may be larger than the guarantee provided by SLAB, which means the allocated object is mis-aligned.

v1: Rust abstractions for network device drivers

This patchset adds minimum Rust abstractions for network device drivers and an example of a Rust network device driver, a simpler version of drivers/net/dummy.c.

v1: rust: bindgen: upgrade to 0.65.1

Upgrades bindgen to code-generation for anonymous unions, structs, and enums [7] for LLVM-16 based toolchains.

The following upgrade also incorporates noreturn support from bindgen allowing us to remove useless loop calls which was placed as a workaround.

BPF

v1: dwarves: dwarves: encode BTF kind layout, crcs

Encode kind layout at time of BTF encoding via –btf_gen_kind_layout and set CRC if –btf_gen_crc is set.

v2: bpf-next: bpf: support BTF kind layout info, CRCs

By separating parsing BTF from using all the information it provides, we allow BTF to encode new features even if they cannot be used. This is helpful in particular for cases where newer tools for BTF generation run on an older kernel; BTF kinds may be present that the kernel cannot yet use, but at least it can parse the BTF provided. Meanwhile userspace tools with newer libbpf may be able to use the newer information.

v2: Reduce overhead of LSMs with static calls

LSM hooks (callbacks) are currently invoked as indirect function calls. These callbacks are registered into a linked list at boot time as the order of the LSMs can be configured on the kernel command line with the “lsm=” command line parameter.

v4: bpf-next: xsk: multi-buffer support

This series of patches add multi-buffer support for AF_XDP. XDP and various NIC drivers already have support for multi-buffer packets. With this patch set, programs using AF_XDP sockets can now also receive and transmit multi-buffer packets both in copy as well as zero-copy mode. ZC multi-buffer implementation is based on ice driver.

v1: nf: netfilter: conntrack: Avoid nf_ct_helper_hash uses after free

If register_nf_conntrack_bpf() fails (for example, if the .BTF section contains an invalid entry), nf_conntrack_init_start() calls nf_conntrack_helper_fini() as part of its cleanup path and nf_ct_helper_hash gets freed.

v1: bpf: bpf/btf: Accept function names that contain dots

When building a kernel with LLVM=1, LLVM_IAS=0 and CONFIG_KASAN=y, LLVM leaves DWARF tags for the “asan.module_ctor” & co symbols. In turn, pahole creates BTF_KIND_FUNC entries for these and this makes the BTF metadata validation fail because they contain a dot.

v1: bpf-next: bpf: generate ‘nomerge’ for map helpers in bpf_helper_defs.h

Update code generation for bpf_helper_defs.h by adding attribute((nomerge)) for a set of helper functions to prevent some verifier unfriendly compiler optimizations.

v1: fprobe: Release rethook after the ftrace_ops is unregistered

While running bpf selftests it’s possible to get following fault:

v1: net: igc: Avoid dereference of ptr_err in igc_clean_rx_irq()

In igc_clean_rx_irq() the result of a call to igc_xdp_run_prog() is assigned to the skb local variable. This may be an ERR_PTR.

A little later the following is executed, which seems to be a possible dereference of an ERR_PTR.

total_bytes += skb->len;

v2: perf/core: Bail out early if the request AUX area is out of bound

‘rb->aux_pages’ allocated by kcalloc() is a pointer array which is used to maintains AUX trace pages. The allocated page for this array is physically contiguous (and virtually contiguous) with an order of 0..MAX_ORDER. If the size of pointer array crosses the limitation set by MAX_ORDER, it reveals a WARNING.

v1: bpf: Force kprobe multi expected_attach_type for kprobe_multi link

We currently allow to create perf link for program with expected_attach_type == BPF_TRACE_KPROBE_MULTI.

This will cause crash when we call helpers like get_attach_cookie or get_func_ip in such program, because it will call the kprobe_multi’s version (current->bpf_ctx context setup) of those helpers while it expects perf_link’s current->bpf_ctx context setup.

v2: bpf-next: Add SO_REUSEPORT support for TC bpf_sk_assign

We want to replace iptables TPROXY with a BPF program at TC ingress. To make this work in all cases we need to assign a SO_REUSEPORT socket to an skb, which is currently prohibited. This series adds support for such sockets to bpf_sk_assing. See patch 5 for details.

v6: bpf-next: Add benchmark for bpf memory allocator

This patchset includes some trivial fixes for benchmark framework and a new benchmark for bpf memory allocator originated from handle-reuse patchset. Because htab-mem benchmark depends the fixes, so I post these patches together.

v2: lib/test_bpf: Call page_address() on page acquired with GFP_KERNEL flag

generate_test_data() acquires a page with alloc_page(GFP_KERNEL). Pages allocated with GFP_KERNEL cannot come from Highmem. This is why there is no need to call kmap() on them.

Therefore, use a plain page_address() on that page.

v5: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING

Therefore, let’s enhance it by increasing the function arguments count allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.

In the 1st patch, we clean garbage value in upper bytes of the trampoline when we store the arguments from regs into stack.

In the 2nd patch, we make arch_prepare_bpf_trampoline() support to copy function arguments in stack for x86 arch. Therefore, the maximum arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY and FEXIT. Meanwhile, we clean the potentian garbage value when we copy the arguments on-stack.

v1: bpf-next: bpf: netdev TX metadata

The goal of this series is to add two new standard-ish places in the transmit path:

  1. Right before the packet is transmitted (with access to TX descriptors)
  2. Right after the packet is actually transmitted and we’ve received the completion (again, with access to TX completion descriptors)

v5: bpf-next: verify scalar ids mapping in regsafe()

This example is unsafe because not all execution paths verify r7 range. Because of the jump at (4) the verifier would arrive at (6) in two states: I. r6{.id=b}, r7{.id=b} via path 1-6; II. r6{.id=a}, r7{.id=b} via path 1-4, 6.

Currently regsafe() does not call check_ids() for scalar registers, thus from POV of regsafe() states (I) and (II) are identical.

The change is split in two parts:

  • patches #1,2: update for mark_chain_precision() to propagate precision marks through scalar IDs.
  • patches #3,4: update for regsafe() to use a special version of check_ids() for precise scalar values.

v3: bpf-next: bpf: Support ->fill_link_info for kprobe_multi and perf_event links

This patchset enhances the usability of kprobe_multi programs by introducing support for ->fill_link_info. This allows users to easily determine the probed functions associated with a kprobe_multi program. While bpftool perf show already provides information about functions probed by perf_event programs, supporting ->fill_link_info ensures consistent access to this information across all bpf links.

v4: net-next: introduce page_pool_alloc() API

In [1] & [2], there are usecases for veth and virtio_net to use frag support in page pool to reduce memory usage, and it may request different frag size depending on the head/tail room space for xdp_frame/shinfo and mtu/packet size. When the requested frag size is large enough that a single page can not be split into more than one frag, using frag support only have performance penalty because of the extra frag count handling for frag support.

v1: lib/test_bpf: Replace kmap() with kmap_local_page()

kmap() has been deprecated in favor of the kmap_local_page() due to high cost, restricted mapping space, the overhead of a global lock for synchronization, and making the process sleep in the absence of free slots.

v1: Add a sysctl option to disable bpf offensive helpers.

Some eBPF helper functions have been long regarded as problematic[1]. More than just used for powerful rootkit, these features can also be exploited to harm the containers by perform various attacks to the processes outside the container in the enrtire VM, such as process DoS, information theft, and container escape.

周边技术动态

Qemu

v1: hw/riscv/virt.c: check for ‘ssaia’ with KVM AIA

This patch was inspired by my review and testing of the QEMU KVM AIA work. It’s not dependent on it though, and can be reviewed and merged separately.

v2: target/riscv: Add support for BF16 extensions

Specification for BF16 extensions can be found in: https://github.com/riscv/riscv-bfloat16

The port is available here: https://github.com/plctlab/plct-qemu/tree/plct-bf16-upstream-v2

v1: riscv-to-apply queue

The following changes since commit fdd0df5340a8ebc8de88078387ebc85c5af7b40f:

Merge tag ‘pull-ppc-20230610’ of https://gitlab.com/danielhb/qemu into staging (2023-06-10 07:25:00 -0700)

are available in the Git repository at:

https://github.com/alistair23/qemu.git tags/pull-riscv-to-apply-20230614

for you to fetch changes up to 860029321d9ebdff47e89561de61e9441fead70a:

v2: disas/riscv: Add vendor extension support

This series adds vendor extension support to the QEMU disassembler for RISC-V. The following vendor extensions are covered:

  • XThead{Ba,Bb,Bs,Cmo,CondMov,FMemIdx,Fmv,Mac,MemIdx,MemPair,Sync}
  • XVentanaCondOps


Read Album:

Read Related:

Read Latest: