泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

儿童Linux系统,可打字编程学数理化
请稍侯

RISC-V Linux 内核及周边技术动态第 49 期

呀呀呀 创作于 2023/06/13

时间:20230611
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v3: Add D1/T113s thermal sensor controller support

This series adds support for Allwinner D1/T113s thermal sensor controller. THIS controller is similar to the one on H6, but with only one sensor and uses a different scale and offset values.

v5: Add support for Allwinner GPADC on D1/T113s/R329/T507 SoCs

This series adds support for general purpose ADC (GPADC) on new Allwinner’s SoCs, such as D1, T113s, T507 and R329. The implemented driver provides basic functionality for getting ADC channels data.

v1: dt-bindings: riscv: cpus: switch to unevaluatedProperties: false

Do the various bits needed to drop the additionalProperties: true that we currently have in riscv/cpu.yaml, to permit actually enforcing what people put in cpus nodes.

v1: riscv: move memblock_allow_resize() after lm is ready

The initial memblock metadata is accessed from kernel image mapping. The regions arrays need to “reallocated” from memblock and accessed through linear mapping to cover more memblock regions. So the resizing should not be allowed until linear mapping is ready. Note that there are memblock allocations when building linear mapping.

v3: RISCV: Add KVM_GET_REG_LIST API

KVM_GET_REG_LIST will dump all register IDs that are available to KVM_GET/SET_ONE_REG and It’s very useful to identify some platform regression issue during VM migration.

v2: arch: allow pte_offset_map[_lock]() to fail

Here is v2 series of patches to various architectures, based on v6.4-rc5: preparing for v2 of changes following in mm, affecting pte_offset_map() and pte_offset_map_lock(). There are very few differences from v1: noted patch by patch below.

v2: dt-bindings: riscv: deprecate riscv,isa

When the RISC-V dt-bindings were accepted upstream in Linux, the base ISA etc had yet to be ratified. By the ratification of the base ISA, incompatible changes had snuck into the specifications - for example the Zicsr and Zifencei extensions were spun out of the base ISA.

Patch “riscv: vmlinux.lds.S: Explicitly handle ‘.got’ section” has been added to the 6.3-stable tree

This is a note to let you know that I’ve just added the patch titled

riscv: vmlinux.lds.S: Explicitly handle '.got' section

to the 6.3-stable tree which can be found at:http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:riscv-vmlinux.lds.s-explicitly-handle-.got-section.patch and it can be found in the queue-6.3 subdirectory.

v1: riscv: reserve DTB before possible memblock allocation

It’s possible that early_init_fdt_scan_reserved_mem() allocates memory from memblock for dynamic reserved memory in /reserved-memory node. Any fixed reservation must be done before that to avoid potential conflicts.

v3: tools/nolibc: add a new syscall helper

This is the revision of the v2 syscall helpers [1], it is based on -ENOSYS patchset [3], so, it is ok to simply merge both of them.

This revision mainly applied Thomas’ method, removed the __syscall() helper and replaced it with __sysret() instead, because __syscall() looks like _syscall() and syscall(), it may mixlead the developers.

v4: nolibc: add part2 of support for rv32

This is the v4 part2 of support for rv32 (v3 [1]), it applied the suggestions from Thomas, Arnd [2] and you [3]. now, the rv32 compile support almost aligned with x86 except the extra KARCH to make kernel happy, thanks very much for your nice review!

v3: riscv: Introduce KASLR

The new virtual kernel location is limited by the early page table that only has one PUD and with the PMD alignment constraint, the kernel can only take < 512 positions.

v4: Add JH7110 cpufreq support

This patchset adds the compatible strings into the allowlist for supporting the generic cpufreq driver on JH7110 SoC. Also, it enables the axp15060 pmic for the cpu power source.

v2: tools/nolibc: add two new syscall helpers

This is the revision of the v1 syscall helpers [1], just rebased it on patchset [3], so, it is ok to simply merge both of them.

This revision mainly applied your suggestions of v1, both of the syscall return and call helpers are simplified or cleaned up.

v2: Documentation: RISC-V: patch-acceptance: mention patchwork’s role

Palmer suggested at some point, not sure if it was in one of the weekly linux-riscv syncs, or a conversation at FOSDEM, that we should document the role of the automation running on our patchwork instance plays in patch acceptance.

v3: gpio: sifive: Add missing check for platform_get_irq

Add the missing check for platform_get_irq() and return error code if it fails. The returned error code will be dealed with in builtin_platform_driver(sifive_gpio_driver) and the driver will not be registered.

v2: perf parse-regs: Refactor architecture functions

This patch series is to refactor arch related functions for register parsing, which follows up the discussion for v1: https://lore.kernel.org/lkml/20230520025537.1811986-1-leo.yan@linaro.org/

v1: 6.3: riscv: vmlinux.lds.S: Explicitly handle ‘.got’ section

This is not an issue in mainline because handling of the .got section was added by commit 39b33072941f (“riscv: Introduce CONFIG_RELOCATABLE”) and further extended by commit 26e7aacb83df (“riscv: Allow to downgrade paging mode from the command line”) in 6.4-rc1. Neither of these changes are suitable for stable, so add explicit handling of the .got section in a standalone change to align 6.3 and mainline, which addresses the warning.

v21: -next: riscv: Add vector ISA support

This is the v21 patch series for adding Vector extension support in Linux. Please refer to [1] for the introduction of the patchset. The v21 patch series was aimed to solve build issues from v19, provide usage guideline for the prctl interface, and address review comments on v20.

v3: gpio: ath79: Add missing check for platform_get_irq

Add the missing check for platform_get_irq() and return error if it fails.

进程调度

v1: sched/deadline: merge __dequeue_dl_entity() into its sole caller

Sole caller dequeue_dl_entity() calls __dequeue_dl_entity() directly. So __dequeue_dl_entity() can be merged into its sole caller. No functional change intended.

v1: net/sched: act_pedit: Use kmemdup() to replace kmalloc + memcpy

./net/sched/act_pedit.c:245:21-28: WARNING opportunity for kmemdup.

Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=5478

v1: sched/wait: Determine whether the wait queue is empty before waking up

When we did some benchmark tests (such as pipe tests), we found that the wake behavior was still triggered when the wait queue was empty, even though it would exit later.

v3: net/sched: Set the flushing flags to false to prevent an infinite loop and add one test to tdc

[root@localhost tc-testing]# ./tdc.py -f tc-tests/infra/filter.json Test c2b4: Adding a new filter after flushing empty chain doesn’t cause an infinite loop All test results: 1..1 ok 1 c2b4 - Adding a new filter after flushing empty chain doesn’t cause an infinite loop

v2: sched/nohz: Add HRTICK_BW for using cfs bandwidth with nohz_full

CFS bandwidth limits and NOHZ full don’t play well together. Tasks can easily run well past their quotas before a remote tick does accounting. This leads to long, multi-period stalls before such tasks can run again. Use the hrtick mechanism to set a sched tick to fire at remaining_runtime in the future if we are on a nohz full cpu, if the task has quota and if we are likely to disable the tick (nr_running == 1). This allows for bandwidth accounting before tasks go too far over quota.

v1: sched/idle: disable tick in idle=poll idle entry

Commit a5183862e76fdc25f36b39c2489b816a5c66e2e5 (“tick/nohz: Conditionally restart tick on idle exit”) allows a nohz_full CPU to enter idle and return from it with the scheduler tick disabled (since the tick might be undesired noise).

The idle=poll case still unconditionally restarts the tick when entering idle.

To reduce the noise for that case as well, stop the tick when entering idle, for the idle=poll case.

v1: sched/debug,sched/core: Reset hung task detector while processing sysrq-t

On devices with multiple CPUs and multiple processes, outputting lengthy sysrq-t content on a slow serial port can consume a significant amount of time. We need to reset the hung task detector to avoid false hung task alerts.

v1: net/sched: Set the flushing flags to false to prevent an infinite loop

On 06/06/2023 11:45, renmingshuai wrote:

When a new chain is added by using tc, one soft lockup alarm will begenerated after delete the prio 0 filter of the chain. To reproducethe problem, perform the following steps: (1) tc qdisc add dev eth0 root handle 1: htb default 1 (2) tc chain add dev eth0 (3) tc filter del dev eth0 chain 0 parent 1: prio 0 (4) tc filter add dev eth0 chain 0 parent 1:

v1: sched: use kmem_cache_zalloc() to zero allocated tg

It’s more convenient to use kmem_cache_zalloc() to allocate zeroed tg. No functional change intended.

内存管理

v2: lib: Replace kmap() with kmap_local_page()

kmap() has been deprecated in favor of the kmap_local_page() due to high cost, restricted mapping space, the overhead of a global lock for synchronization, and making the process sleep in the absence of free slots.

v1: mm: compaction: mark kcompactd_run() and kcompactd_stop() __meminit

Add __meminit to kcompactd_run() and kcompactd_stop() to ensure they’re default to __init when memory hotplug is not enabled.

v1: mm: hugetlb: Add Kconfig option to set default nr_overcommit_hugepages

The default kernel configuration does not allow any huge page allocation until after setting nr_hugepages or nr_overcommit_hugepages to a non-zero value; without setting those, mmap attempts with MAP_HUGETLB will always fail with -ENOMEM. nr_overcommit_hugepages allows userspace to attempt to allocate huge pages at runtime, succeeding if the kernel can find or assemble a free huge page.

v1: mm/khugepaged: use DEFINE_READ_MOSTLY_HASHTABLE macro

These are equivalent, but DEFINE_READ_MOSTLY_HASHTABLE exists to define a hashtable in the .data..read_mostly section.

v3: mm/folio: Avoid special handling for order value 0 in folio_set_order

folio_set_order(folio, 0) is used in kernel at two places __destroy_compound_gigantic_folio and __prep_compound_gigantic_folio. Currently, It is called to clear out the folio->_folio_nr_pages and folio->_folio_order.

v2: Optimize the fast path of mas_store()

Add fast paths for mas_wr_append() and mas_wr_slot_store() respectively. The newly added fast path of mas_wr_append() is used in fork() and how much it benefits fork() depends on how many VMAs are duplicated.

v1: net-next: splice, net: Some miscellaneous MSG_SPLICE_PAGES changes

Now that the splice_to_socket() has been rewritten so that nothing now uses the ->sendpage() file op[1], some further changes can be made, so here are some miscellaneous changes that can now be done.

v1: mm: compaction: skip memory hole rapidly when isolating migratable pages

On some machines, the normal zone can have a large memory hole like below memory layout, and we can see the range from 0x100000000 to scanner can meet the hole and it will take more time to skip the large hole. From my measurement, I can see the isolation scanner will take 80us100us to skip the large hole [0x100000000 - 0x1800000000].

v2: mm/vmalloc: Replace the ternary conditional operator with min()

It would be better to replace the traditional ternary conditional operator with min() in zero_iter

v1: net-next: sock: Propose socket.urgent for sockmem isolation

This is just a PoC patch intended to resume the discussion about tcpmem isolation opened by Google in LPC’22 [1].

We are facing the same problem that the global shared threshold can cause isolation issues. Low priority jobs can hog TCP memory and adversely impact higher priority jobs. What’s worse is that these low priority jobs usually have smaller cpu weights leading to poor ability to consume rx data.

v1: revert shrinker_srcu related changes

Kernel test robot reports -88.8% regression in stress-ng.ramfs.ops_per_sec test case [1], which is caused by commit f95bdb700bc6 (“mm: vmscan: make global slab shrink lockless”). The root cause is that SRCU has to be careful to not frequently check for SRCU read-side critical section exits. Therefore, even if no one is currently in the SRCU read-side critical section, synchronize_srcu() cannot return quickly. That’s why unregister_shrinker() has become slower.

v6: mm: ioremap: Convert architectures to take GENERIC_IOREMAP way

Currently, many architecutres have’t taken the standard GENERIC_IOREMAP way to implement ioremap_prot(), iounmap(), and ioremap_xx(), but make these functions specifically under each arch’s folder. Those cause many duplicated codes of ioremap() and iounmap().

v1: watchdog/mm: Allow dumping memory info in pretimeout

On my (embedded) systems, the most common cause of hitting the watchdog (pre)timeout is due to thrashing. Diagnosing these problems is hard without knowing the memory state at the point of the watchdog hit. In order to make this information available, add a module parameter to the watchdog pretimeout panic governor to ask it to dump memory info and the OOM task list (using a new helper in the OOM code) before triggering the panic.

v1: mm/min_free_kbytes: modify min_free_kbytes calculation rules

The current calculation of min_free_kbytes only uses ZONE_DMA and ZONE_NORMAL pages,but the ZONE_MOVABLE zone->_watermark[WMARK_MIN] will also divide part of min_free_kbytes.This will cause the min watermark of ZONE_NORMAL to be too small in the presence of ZONE_MOVEABLE.

**[v1: mm: kill [adddel]_page_to_lru_list()](http://lore.kernel.org/linux-mm/20230609013901.79250-1-wangkefeng.wang@huawei.com/)**

Directly call lruvec_del_folio(), and drop unused page interfaces.

v2: mm: allow pte_offset_map[_lock]() to fail

Here is v2 series of patches to mm, based on v6.4-rc5: preparing for v2 effective changes to follow, probably next week (when I hope s390 will be sorted), affecting pte_offset_map() and pte_offset_map_lock(). There are very few differences from v1: noted patch by patch below.

v1: udmabuf: revert ‘Add support for mapping hugepages (v4)’

This effectively reverts commit 16c243e99d33 (“udmabuf: Add support for mapping hugepages (v4)”). Recently, Junxiao Chang found a BUG with page map counting as described here [1]. This issue pointed out that the udmabuf driver was making direct use of subpages of hugetlb pages. This is not a good idea, and no other mm code attempts such use. In addition to the mapcount issue, this also causes issues with hugetlb vmemmap optimization and page poisoning.

v1: mm: Sync percpu mm RSS counters before querying

An issue was observed with stats collected in struct rusage on ppc64le with 64kB pages. The percpu counters use batching withpercpu_counter_batch = max(32, nr*2) # in PAGE_SIZE i.e. with larger pages but similar RSS consumption (bytes), there’ll be less flushes and error more noticeable.

v1: staging: lib: Use memcpy_to/from_page()

Deprecate kmap() in favor of kmap_local_page() due to high cost, restricted mapping space, the overhead of a global lock for synchronization, and making the process sleep in the absence of free slots.

v3: Documentation/mm: Initial page table documentation

This is based on an earlier blog post at people.kernel.org, it describes the concepts about page tables that were hardest for me to grasp when dealing with them for the first time, such as the prevalent three-letter acronyms pfn, pgd, p4d, pud, pmd and pte.

v4: mm/migrate_device: Try to handle swapcache pages

Migrating file pages and swapcache pages into device memory is not supported. Try to get rid of the swap cache, and if successful, go ahead as with other anonymous pages.

v1: binfmt_elf: dynamically allocate note.data in parse_elf_properties

Dynamically allocate note.data in parse_elf_properties to fix compilation warning on some arch.

v1: mm/mm_init.c: add debug messsge for dma zone

If freesize is less than dma_reserve, print warning message to report this case.

v4: drm-next: v1: DRM GPUVA Manager & Nouveau VM_BIND UAPI

Furthermore, with the DRM GPUVA manager it provides a new DRM core feature to keep track of GPU virtual address (VA) mappings in a more generic way.

The DRM GPUVA manager is indented to help drivers implement userspace-manageable GPU VA spaces in reference to the Vulkan API. In order to achieve this goal it serves the following purposes in this context.

文件系统

v1: fs/aio: Stop allocating aio rings from HIGHMEM

There is no need to allocate aio rings from HIGHMEM because of very little memory needed here.

Therefore, use GFP_USER flag in find_or_create_page() and get rid of kmap*() mappings.

v4: blksnap - block devices snapshots module

I am happy to offer a improved version of the Block Devices Snapshots Module. It allows to create non-persistent snapshots of any block devices. The main purpose of such snapshots is to provide backups of block devices. See more in Documentation/block/blksnap.rst.

v1: Reduce impact of overlayfs fake path files

This is the solution that we discussed for removing FMODE_NONOTIFY from overlayfs real files.

My branch [1] has an extra patch for remove FMODE_NONOTIFY, but I am still testing the ovl-fsnotify interaction, so we can defer that step to later.

v1: ovl: port to new mount api

We recently ported util-linux to the new mount api. Now the mount(8) tool will by default use the new mount api. While trying hard to fall back to the old mount api gracefully there are still cases where we run into issues that are difficult to handle nicely.

v1: bdev: allow buffer-head & iomap aops to co-exist

At LSFMM it was clear that for some in order to support large order folios we want to use iomap. So the filesystems staying and requiring buffer-heads cannot make use of high order folios. This simplifies support and reduces the scope for what we need to do in order to support high order folios for buffered-io.

v2: fs: avoid empty option when generating legacy mount string

As each option string fragment is always prepended with a comma it would happen that the whole string always starts with a comma. This could be interpreted by filesystem drivers as an empty option and may produce errors.

v2: gfs2/buffer folio changes for 6.5

This kind of started off as a gfs2 patch series, then became entwined with buffer heads once I realised that gfs2 was the only remaining caller of __block_write_full_page(). For those not in the gfs2 world, the big point of this series is that block_write_full_page() should now handle large folios correctly.

网络设备

v4: net-next: tcp: enforce receive buffer memory limits by allowing the tcp window to shrink

Under certain circumstances, the tcp receive buffer memory limit set by autotuning (sk_rcvbuf) is increased due to incoming data packets as a result of the window not closing when it should be. This can result in the receive buffer growing all the way up to tcp_rmem[2], even for tcp sessions with a low BDP.

v1: amd-xgbe: extend 10Mbps support to MAC version 21H

MAC version 21H supports the 10Mbps speed. So, extend support to platforms that support it.

v1: dt-bindings: net: mediatek,net: add missing mediatek,mt7621-eth

Document the Ethernet controller found in the MediaTek MT7621 MIPS SoC family which is supported by the mtk_eth_soc driver.

v5: net-next: net: phy: add driver for MediaTek SoC built-in GE PHYs

Some of MediaTek’s Filogic SoCs come with built-in gigabit Ethernet PHYs which require calibration data from the SoC’s efuse. Despite the similar design the driver doesn’t share any code with the existing mediatek-ge.c. Add support for such PHYs by introducing a new driver with basic support for MediaTek SoCs MT7981 and MT7988 built-in 1GE PHYs.

v1: Add a sysctl option to disable bpf offensive helpers.

Some eBPF helper functions have been long regarded as problematic[1]. More than just used for powerful rootkit, these features can also be exploited to harm the containers by perform various attacks to the processes outside the container in the enrtire VM, such as process DoS, information theft, and container escape.

v4: net-next: virtio/vsock: support datagrams

This series introduces support for datagrams to virtio/vsock.

It is a spin-off (and smaller version) of this series from the summer:https://lore.kernel.org/all/cover.1660362668.git.bobby.eshleman@bytedance.com/

Please note that this is an RFC and should not be merged until associated changes are made to the virtio specification, which will follow after discussion from this series.

v1: net-next: net: support extack in dump and simplify ethtool uAPI

Ethtool currently requires header nest to be always present even if it doesn’t have to carry any attr for a given request. This inflicts unnecessary pain on the users.

v1: net-next: tools: ynl: generate code for the ethtool family

And finally ethtool support. Thanks to Stan’s work the ethtool family spec is quite complete, so there is a lot of operations to support.

I chickened out of stats-get support, they require at the very least type-value support on a u64 scalar. Type-value is an arrangement where a u16 attribute is encoded directly in attribute type. Code gen can support this if the inside is a nest, we just throw in an extra field into that nest to carry the attr type. But a little more coding is needed to for a scalar, because first we need to turn the scalar into a struct with one member, then we can add the attr type.

v4: iwl-next: Implement support for SRIOV + LAG

The first interface added into the aggregate will be flagged as the primary interface, and this primary interface will be responsible for managing the VF’s resources. VF’s created on the primary are the only VFs that will be supported on the aggregate. Only Active-Backup mode will be supported and only aggregates whose primary interface is in switchdev mode will be supported.

v2: ipvs: align inner_mac_header for encapsulation

When using encapsulation the original packet’s headers are copied to the inner headers. This preserves the space for an inner mac header, which is not used by the inner payloads for the encapsulation types supported by IPVS. If a packet is using GUE or GRE encapsulation and needs to be segmented, flow can be passed to __skb_udp_tunnel_segment() which calculates a negative tunnel header length. A negative tunnel header length causes pskb_may_pull() to fail, dropping the packet.

v1: net-next: tcp: tx path fully headless

This series completes transition of TCP stack tx path to headless packets: All payload now reside in page frags, never in skb->head.

v1: net-next: net: create device lookup API with reference tracking

We still see dev_hold() / dev_put() calls without reference tracker getting added in the new code. dev_get_by_name() / dev_get_by_index() seem to be one of the sources of those. Provide appropriate helpers. Allocating the tracker can obviously be done with an additional call to netdev_tracker_alloc(), but a single API feels cleaner.

v1: net-next: mdio: mdio-mux-mmioreg: Use of_property_read_reg() to parse “reg”

Use the recently added of_property_read_reg() helper to get the untranslated “reg” address value.

v1: net-next: net: add check for current MAC address in dev_set_mac_address

In some cases it is possible for kernel to come with request to change primary MAC address to the address that is already set on the given interface.

v2: net: Check if FIPS mode is enabled when running selftests

Some test cases from net/tls, net/fcnal-test and net/vrf-xfrm-tests that rely on cryptographic functions to work and use non-compliant FIPS algorithms fail in FIPS mode.

v1: net-next: tcp: Make pingpong threshold tunable

TCP pingpong threshold is 1 by default. But some applications, like SQL DB may prefer a higher pingpong threshold to activate delayed acks in quick ack mode for better performance.

v7: net-next: net: ioctl: Use kernel memory on protocol ioctl callbacks

Most of the ioctls to net protocols operates directly on userspace argument (arg). Usually doing get_user()/put_user() directly in the ioctl callback. This is not flexible, because it is hard to reuse these functions without passing userspace buffers.

v1: net-next: rhashtable: length helper for rhashtable and rhltable

Whenever someone wants to retrieve the total number of elements in a rhashtable/rhltable it needs to open code the access to ‘nelems’. Therefore provide a helper for such operation and convert two accesses as an example.

v1: net-next: add egress rate limit offload for Marvell 6393X family

This series aims to give access to egress rate shaping offloading available on Marvell 88E6393X family (88E6393X/88E6193X/88E6191X/88E6361)

The switch offers a very basic egress rate limiter: rate can be configured from 64kbps up to 10gbps depending on the model, with some specific increments depending on the targeted rate, and is “burstless”.

v1: net-next: net: openvswitch: add support for l4 symmetric hashing

Since its introduction, the ovs module execute_hash action allowed hash algorithms other than the skb->l4_hash to be used. However, additional hash algorithms were not implemented. This means flows requiring different hash distributions weren’t able to use the kernel datapath.

v1: net-next: bnx2x: Make dmae_reg_go_c static

Make dmae_reg_go_c static, it is only used in bnx2x_main.c

Flagged by Sparse as:

…/bnx2x_main.c:291:11: warning: symbol ‘dmae_reg_go_c’ was not declared. Should it be static?

v2: net-next: net: mana: Add support for vlan tagging

To support vlan, use MANA_LONG_PKT_FMT if vlan tag is present in TX skb. Then extract the vlan tag from the skb struct, and save it to tx_oob for the NIC to transmit. For vlan tags on the payload, they are accepted by the NIC too.

[net PATCH v2] octeontx2-af: Move validation of ptp pointer before its usage

Moved PTP pointer validation before its use to avoid smatch warning. Also used kzalloc/kfree instead of devm_kzalloc/devm_kfree.

v1: net-next: phylink EEE support

There has been some recent discussion on generalising EEE support so that drivers implement it more consistently. This has mostly focused around phylib, but there are other situations where EEE may be useful.

v1: net-next: sfc: Add devlink dev info support for EF10

Reuse the work done for EF100 to add devlink support for EF10. There is no devlink port support for EF10.

安全增强

v1: kunit: Add test attributes API

This is an RFC patch series to propose the addition of a test attributes framework to KUnit.

There has been interest in filtering out “slow” KUnit tests. Most notably, a new config, CONFIG_MEMCPY_SLOW_KUNIT_TEST, has been added to exclude particularly slow memcpy tests (https://lore.kernel.org/all/20230118200653.give.574-kees@kernel.org/).

v1: Integer overflows while scanning for integers

Lately I wondered whether users of integer scanning functions check for overflows. To detect such overflows around scanf I came up with the following patch. It simply triggers a WARN_ON_ONCE() upon an overflow.

[RESEND]v1: next: Replace one-element array with DECLARE_FLEX_ARRAY() helper

One-element arrays as fake flex arrays are deprecated and we are moving towards adopting C99 flexible-array members, instead. So, replace one-element array declaration in struct ct_sns_gpnft_rsp, which is ultimately being used inside a union:

drivers/scsi/qla2xxx/qla_def.h:

Refactor the rest of the code, accordingly.

This issue was found with the help of Coccinelle.

v1: um: Use HOST_DIR for mrproper

When HEADER_ARCH was introduced, the MRPROPER_FILES (then MRPROPER_DIRS) list wasn’t adjusted, leaving SUBARCH as part of the path argument. This resulted in the “mrproper” target not cleaning up arch/x86/… when SUBARCH was specified. Since HOST_DIR is arch/$(HEADER_ARCH), use it instead to get the correct path.

v2: uml: Replace strlcpy with strscpy

strlcpy() reads the entire source buffer first. This read may exceed the destination size limit. This is both inefficient and can lead to linear read overflows if a source string is not NUL-terminated [1]. In an effort to remove strlcpy() completely [2], replace strlcpy() here with strscpy(). No return values were used, so direct replacement is safe.

[1] https://www.kernel.org/doc/html/latest/process/deprecated.html#strlcpy [2] https://github.com/KSPP/linux/issues/89

Closes: https://lore.kernel.org/oe-kbuild-all/202305311135.zGMT1gYR-lkp@intel.com/

异步 IO

v1: Add io_uring support for futex wait/wake

Sending this just to the io_uring list for now so we can iron out details, questions, concerns, etc before going a bit broader to get the futex parts reviewed. Those are pretty straight forward though, and try not to get too entangled into futex internals.

v15: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

v14: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

Rust For Linux

v3: Rust scatterlist abstractions

This is a version of scatterlist abstractions for Rust drivers.

Scatterlist is used for efficient management of memory buffers, which is essential for many kernel-level operations such as Direct Memory Access (DMA) transfers and crypto APIs.

v2: add abstractions for network device drivers

This patchset adds minimum abstractions for network device drivers and Rust dummy network device driver, a simpler version of drivers/net/dummy.c.

v1: Rust PuzzleFS filesystem driver

This is a proof of concept driver written for the PuzzleFS next-generation container filesystem [1]. I’ve included a short abstract about puzzlefs further below. This driver is based on the rust-next branch, on top of which I’ve backported the filesystem abstractions from Wedson Almeida Filho [2][3] and Miguel Ojeda’s third-party crates support: proc-macro2, quote, syn, serde and serde_derive [4]. I’ve added the additional third-party crates serde_cbor[5] and hex [6]. Then I’ve adapted the user space puzzlefs code [1] so that the puzzlefs kernel module could present the directory hierarchy and implement the basic read functionality.

v2: Rust enablement for AArch64

The first patch enables the basic building of Rust for AArch64. Since v1 this has been rewritten to avoid the use of a target.json file for AArch64 and use the upstream rustc target definition. x86-64 still uses the target.json approach though.

BPF

v12: evm: Do HMAC of multiple per LSM xattrs for new inodes

One of the major goals of LSM stacking is to run multiple LSMs side by side without interfering with each other. The ultimate decision will depend on individual LSM decision.

v1: tools api fs: More thread safety for global filesystem variables

Multiple threads, such as with “perf top”, may race to initialize a file system path like hugetlbfs. The racy initialization of the path leads to at least memory leaks. To avoid this initialize each fs for reading the mount point path with pthread_once.

v4: bpf-next: verify scalar ids mapping in regsafe()

This example is unsafe because not all execution paths verify r7 range. Because of the jump at (4) the verifier would arrive at (6) in two states: I. r6{.id=b}, r7{.id=b} via path 1-6; II. r6{.id=a}, r7{.id=b} via path 1-4, 6.

Currently regsafe() does not call check_ids() for scalar registers, thus from POV of regsafe() states (I) and (II) are identical.

v3: net-next: introduce page_pool_alloc() API

In [1] & [2], there are usecases for veth and virtio_net to use frag support in page pool to reduce memory usage, and it may request different frag size depending on the head/tail room space for xdp_frame/shinfo and mtu/packet size. When the requested frag size is large enough that a single page can not be split into more than one frag, using frag support only have performance penalty because of the extra frag count handling for frag support.

v4: bpf-next: bpf, x86: allow function arguments up to 12 for TRACING

Therefore, let’s enhance it by increasing the function arguments count allowed in arch_prepare_bpf_trampoline(), for now, only x86_64.

In the 1st patch, we make arch_prepare_bpf_trampoline() support to copy function arguments in stack for x86 arch. Therefore, the maximum arguments can be up to MAX_BPF_FUNC_ARGS for FENTRY and FEXIT.

v3: Bring back vmlinux.h generation

Commit 760ebc45746b (“perf lock contention: Add empty ‘struct rq’ to satisfy libbpf ‘runqueue’ type verification”) inadvertently created a declaration of ‘struct rq’ that conflicted with a generated vmlinux.h’s:

v5: bpf-next: selftests/bpf: Add benchmark for bpf memory allocator

The benchmark could be used to compare the performance of hash map operations and the memory usage between different flavors of bpf memory allocator (e.g., no bpf ma vs bpf ma vs reuse-after-gp bpf ma). It also could be used to check the performance improvement or the memory saving provided by optimization.

v1: ftrace: Show all functions with addresses in available_filter_functions_addrs

when ftrace based tracers we need to cross check available_filter_functions with /proc/kallsyms. For example for kprobe_multi bpf link (based on fprobe) we need to make sure that symbol regex resolves to traceable symbols and that we get proper addresses for them.

v5: bpf: Socket lookup BPF API from tc/xdp ingress does not respect VRF bindings.

When calling socket lookup from L2 (tc, xdp), VRF boundaries aren’t respected. This patchset fixes this by regarding the incoming device’s VRF attachment when performing the socket lookups from tc/xdp.

v2: bpf-next: bpf: Support ->fill_link_info for kprobe_multi and perf_event links

This patchset enhances the usability of kprobe_multi programs by introducing support for ->fill_link_info. This allows users to easily determine the probed functions associated with a kprobe_multi program. While bpftool perf show already provides information about functions probed by perf_event programs, supporting ->fill_link_info ensures consistent access to this information across all bpf links.

v1: perf lock contention: Add -x option for CSV style output

Sometimes we want to process the output by external programs. Let’s add the -x option to specify the field separator like perf stat.

v2: bpf-next: BPF token

This patch set introduces new BPF object, BPF token, which allows to delegate a subset of BPF functionality from privileged system-wide daemon (e.g., systemd or any other container manager) to a trusted unprivileged application. Trust is the key here. This functionality is not about allowing unconditional unprivileged BPF usage. Establishing trust, though, is completely up to the discretion of respective privileged application that would create a BPF token.

v1: bpf-next: selftests/bpf: Add missing prototypes for several test kfuncs

Adding missing prototypes for several kfuncs that are used by test_verifier tests. We don’t really need kfunc prototypes for these tests, but adding them to silence ‘make W=1’ build and to have all test kfuncs declarations in bpf_testmod_kfunc.h.

v2: bpf-next: BPF link support for tc BPF programs

This series adds BPF link support for tc BPF programs. We initially presented the motivation, related work and design at last year’s LPC conference in the networking & BPF track [0], and a recent update on our progress of the rework during this year’s LSF/MM/BPF summit [1]. The main changes are in first two patches and the last two have an extensive batch of test cases we developed along with it, please see individual patches for details. We tested this series with tc-testing selftest suite as well as BPF CI/selftests. Thanks!

v2: bpf-next: bpf, arm64: use BPF prog pack allocator in BPF JIT

BPF programs currently consume a page each on ARM64. For systems with many BPF programs, this adds significant pressure to instruction TLB. High iTLB pressure usually causes slow down for the whole system.

v4: bpf-next: Handle immediate reuse in bpf memory allocator

The implementation of v4 is mainly based on suggestions from Alexi [0]. There are still pending problems for the current implementation as shown in the benchmark result in patch #3, but there was a long time from the posting of v3, so posting v4 here for further disscussions and more suggestions.

v1: bpf: search_bpf_extables should search subprogram extables

JIT’d bpf programs that have subprograms can have a postive value for num_extentries but a NULL value for extable. This is problematic if one of these bpf programs encounters a fault during its execution. The fault handlers correctly identify that the faulting IP belongs to a bpf program. However, performing a search_extable call on a NULL extable leads to a second fault.

v3: bpf-next: xsk: multi-buffer support

This series of patches add multi-buffer support for AF_XDP. XDP and various NIC drivers already have support for multi-buffer packets. With this patch set, programs using AF_XDP sockets can now also receive and transmit multi-buffer packets both in copy as well as zero-copy mode. ZC multi-buffer implementation is based on ice driver.

v2: bpf: netfilter: add BPF_NETFILTER bpf_attach_type

Andrii Nakryiko writes:

And we currently don’t have an attach type for NETLINK BPF link.Thankfully it’s not too late to add it. I see that link_create() inkernel/bpf/syscall.c just bypasses attach_type check. We shouldn’thave done that. Instead we need to add BPF_NETLINK attach type to enumbpf_attach_type. And wire all that properly throughout the kernel andlibbpf itself.

v1: Add api to manipulate global varaible

We (the antgroup) has a requirement to manipulate global variables. The platform to manage bpf bytecode has no idea about varaibles’ type/size/address. It only has some strings (like key = value) passed from admin. We find a way to parse BTF and then query/update the variables. There may be better ways to do it. This approach is what we can find for now.

v1: bpf: Add extra path pointer check to d_path helper

Anastasios reported crash on stable 5.15 kernel with following bpf attached to lsm hook:

SEC(“lsm.s/bprm_creds_for_exec”)int BPF_PROG(bprm_creds_for_exec, struct linux_binprm *bprm){struct path *path = &bprm->executable->f_path;char p[128] = { 0 };

      bpf_d_path(path, p, 128);
      return 0;   }

but bprm->executable can be NULL, so bpf_d_path call will crash:

周边技术动态

Qemu

v3: linux-user/riscv: Add syscall riscv_hwprobe

This patch adds the new syscall for the “RISC-V Hardware Probing Interface” (https://docs.kernel.org/riscv/hwprobe.html).

v4: target/riscv: Add Smrnmi support.

This patchset added support for Smrnmi Extension in RISC-V.

RNMI also has higher priority than any other interrupts or exceptions and cannot be disabled by software.

RNMI may be used to route to other devices such as Bus Error Unit or Watchdog Timer in the future.

Buildroot

[branch/2023.02.x] package/cmake: (ctest) add support for riscv architecture

commit: https://git.buildroot.net/buildroot/commit/?id=13e4f1942cb2aca57edf5b2b7514d491690e8eeb branch: https://git.buildroot.net/buildroot/commit/?id=refs/heads/2023.02.x

Package binaries can be successfully built for and then executed on RISC-V platforms including RV32 and RV64 variants. Tested in QEMU.

U-Boot

v4: SPL NVMe support

This patchset adds support to load images of the SPL’s next booting stage from a NVMe device.



Read Album:

Read Related:

Read Latest: