[置顶] 泰晓 RISC-V 实验箱,配套 30+ 讲嵌入式 Linux 系统开发公开课
RISC-V Linux 内核及周边技术动态第 77 期
时间:20240204
编辑:晓怡
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS
内核动态
RISC-V 架构支持
v8: KVM: selftests: Add SEV smoke test
Add a basic SEV smoke test. Unlike the intra-host migration tests, this one actually runs a small chunk of code in the guest.
v3: riscv: Use CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS to set misaligned access speed
If CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is enabled, no time needs to be spent in the misaligned access speed probe. Disable the probe in this case and set respective uses to “fast” misaligned accesses. On riscv, this config is selected if RISCV_EFFICIENT_UNALIGNED_ACCESS is selected, which is dependent on NONPORTABLE.
v1: riscv: Only flush the mm icache when setting an exec pte
We used to emit a flush_icache_all() whenever a dirty executable mapping is set in the page table but we can instead call flush_icache_mm() which will only send IPIs to cores that currently run this mm and add a deferred icache flush to the others.
v9: riscv: sophgo: add clock support for sg2042
This series adds clock controller support for sophgo sg2042.
v2: riscv: add CALLER_ADDRx support
CALLER_ADDRx returns caller’s address at specified level, they are used for several tracers. These macros eventually use __builtin_return_address(n) to get the caller’s address if arch doesn’t define their own implementation.
v1: riscv: hwprobe: export VA_BITS
Some userspace applications (OpenJDK for instance) uses the free bits in pointers to insert additional information for their own logic. Currently they rely on parsing /proc/cpuinfo to obtain the current value of virtual address used bits [1]. Exporting VA_BITS through hwprobe will allow for a more stable interface to be used.
进程调度
v1: sched: Add trace events for Proxy Execution (PE)
Add sched_[start, finish]_task_selection trace events to measure the latency of PE patches in task selection.
Moreover, introduce trace events for interesting events in PE:
- sched_pe_enqueue_sleeping_task: a task gets enqueued on wait queue of a sleeping task (mutex owner).
- sched_pe_cross_remote_cpu: dependency chain crosses remote CPU.
- sched_pe_task_is_migrating: mutex owner task migrates.
v2: sched/fair: Defer CFS throttle to user entry
CFS tasks can end up throttled while holding locks that other, non-throttled tasks are blocking on.
For !PREEMPT_RT, this can be a source of latency due to the throttling causing a resource acquisition denial.
For PREEMPT_RT, this is worse and can lead to a deadlock: o A CFS task p0 gets throttled while holding read_lock(&lock) o A task p1 blocks on write_lock(&lock), making further readers enter theslowpath o A ktimers or ksoftirqd task blocks on read_lock(&lock)
v5: net/sched: Load modules via alias
These modules may be loaded lazily without user’s awareness and control. Add respective aliases to modules and request them under these aliases so that modprobe’s blacklisting mechanism (through aliases) works for them. (The same pattern exists e.g. for filesystem modules.)
For example (before the change):$ tc filter add dev lo parent 10: protocol ip prio 10 handle 1: cgroup# cls_cgroup module is loaded despite a
blacklist cls_cgroup
entry# in /etc/modprobe.d/*.conf
内存管理
v1: lib/bch.c: increase bitrev single conversion length
Optimized the performance of the three functions (load_ecc8 store_ecc8 and bch_encode) using a larger calculation length.
v1: mm/zswap: invalidate old entry when store fail or !zswap_enabled
We may encounter duplicate entry in the zswap_store():
swap slot that freed to per-cpu swap cache, doesn’t invalidate the zswap entry, then got reused. This has been fixed.
!exclusive load mode, swapin folio will leave its zswap entry on the tree, then swapout again. This has been removed.
one folio can be dirtied again after zswap_store(), so need to zswap_store() again. This should be handled correctly.
v1: meminfo: provide estimated per-node’s available memory
The system offers an estimate of the per-node’s available memory, in addition to the system’s available memory provided by /proc/meminfo.
like commit 34e431b0ae39(“/proc/meminfo: provide estimated available memory”), it is more convenient to provide such an estimate in /sys/bus/node/devices/nodex/meminfo. If things change in the future, we only have to change it in one place.
v5: -next: minor improvements for x86 mce processing
In this patchset, we remove the unused macro EX_TYPE_COPY and centralize the processing of memory-failure to do_machine_check() to avoid calling memory_failure_queue() separately for different MC-Safe scenarios. In addition, MCE_IN_KERNEL_COPYIN is renamed MCE_IN_KERNEL_COPY_MC to expand its usage scope.
v11: ACPI: APEI: handle synchronous exceptions in task work to send correct SIGBUS si_code
changes since v5 by addressing comments from Kefeng:
- document return value of memory_failure()
- drop redundant comments in call site of memory_failure()
- make ghes_do_proc void and handle abnormal case within it
- pick up reviewed-by tag from Kefeng Wang
v8: arm64/gcs: Provide support for GCS in userspace
The arm64 Guarded Control Stack (GCS) feature provides support for hardware protected stacks of return addresses, intended to provide hardening against return oriented programming (ROP) attacks and to make it easier to gather call stacks for applications such as profiling.
v3: mm/mmap: pass vma to vma_merge()
These vma_merge() callers will pass mm, anon_vma and file, they all from the same vma. There is no need to pass three parameters at the same time.
Pass vma instead of mm, anon_vma and file to vma_merge(), so that it can save two parameters.
v3: mm: memcg: Use larger batches for proactive reclaim
Before 388536ac291 (“mm:vmscan: fix inaccurate reclaim during proactive reclaim”) we passed the number of pages for the reclaim request directly to try_to_free_mem_cgroup_pages, which could lead to significant overreclaim. After 0388536ac291 the number of pages was limited to a maximum 32 (SWAP_CLUSTER_MAX) to reduce the amount of overreclaim. However such a small batch size caused a regression in reclaim performance due to many more reclaim start/stop cycles inside memory_reclaim.
v1: mm: Reduce dependencies on <linux/kernel.h>
“page_counter.h” does not need <linux/kernel.h>. <linux/limits.h> is enough to get LONG_MAX.
Files that include page_counter.h are limited. They have been compile tested or checked.
v2: iommu/iova: use named kmem_cache for iova magazines
The magazine buffers can take gigabytes of kmem memory, dominating all other allocations. For observability purpose create named slab cache so the iova magazine memory overhead can be clearly observed.
v5: mm/mempolicy: weighted interleave mempolicy and sysfs extension
(v5: style, retry interleave w/ mems_allowed cookiefix sparse warnings, style, review tags)
v3: Enable >0 order folio memory compaction
This patchset enables >0 order folio memory compaction, which is one of the prerequisitions for large folio support[1]. It includes the fix[4] for V2 and is on top of mm-everything-2024-01-29-07-19.
v2: Handle delay slot for extable lookup
This series fixed extable handling for architecture delay slot (MIPS).
Please see previous discussions at [1].
There are some other places in kernel not handling delay slots properly, such as uprobe and kgdb, I’ll sort them later.
Test that KASan can detect some unsafe atomic accesses.
As discussed in the linked thread below, these tests attempt to cover the most common uses of atomics and, therefore, aren’t exhaustive.
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=214055
v5: Transparent Contiguous PTEs for User Mappings
This is a series to opportunistically and transparently use contpte mappings (set the contiguous bit in ptes) for user memory when those mappings meet the requirements. The change benefits arm64, but there is some minor refactoring for x86 and powerpc to enable its integration with core-mm.
v1: regset: use vmalloc() for regset_get_alloc()
An order 7 allocation is (1 « 7) contiguous pages, or 512K. It’s not a surprise that this allocation failed on a system that’s been running for a while.
In this case we’re just generating a core dump and there’s no reason we need contiguous memory. Change the allocation to vmalloc(). We’ll change the free in binfmt_elf to kvfree() which works regardless of how the memory was allocated.
Add mempool_init_kvmalloc_pool() and mempool_create_kvmalloc_pool(), which wrap kvmalloc() instead of kmalloc() - kmalloc() with a vmalloc() fallback.
文件系统
v2: Restore data lifetime support
UFS devices are widely used in mobile applications, e.g. in smartphones. UFS vendors need data lifetime information to achieve good performance. Providing data lifetime information to UFS devices can result in up to 40% lower write amplification. Hence this patch series that restores the bi_write_hint member in struct bio. After this patch series has been merged, patches that implement data lifetime support in the SCSI disk (sd) driver will be sent to the Linux kernel SCSI maintainer.
v9: io_uring: add support for ftruncate
This patch adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates.
v1: Decomplicate file_dentry()
Miklos,
When posting the patches for file_user_path(), I wrote [1]:
“This change already makes file_dentry() moot, but for now we did notchange this helper just added a WARN_ON() in ovl_d_real() to catch if wehave made any wrong assumptions.
v1: remap_range: merge do_clone_file_range() into vfs_clone_file_range()
commit dfad37051ade (“remap_range: move permission hooks out of do_clone_file_range()”) moved the permission hooks from do_clone_file_range() out to its caller vfs_clone_file_range(), but left all the fast sanity checks in do_clone_file_range().
v1: fs/address_space: move i_mmap_rwsem to mitigate a false sharing with i_mmap.
In the struct address_space, there is a 32-byte gap between i_mmap and i_mmap_rwsem. Due to the alignment of struct address_space variables to 8 bytes, in certain situations, i_mmap and i_mmap_rwsem may end up in the same CACHE line.
v1: __fs_parse: Correct a documentation comment
Commit 7f5d38141e30 (“new primitive: __fs_parse()”) taking p_log instead of fs_context.
So, update that comment to refer to p_log instead
This patchset removes uses of struct page from the I/O paths of JFS. write_begin and write_end are still passed a struct page, but they convert to a folio as their first thing. The logmgr still uses a struct page, but I think that’s one we actually don’t want to convert since it’s never inserted into the page cache.
v1: blk: optimization for classic polling
This removes the dependency on interrupts to wake up task. Set task state as TASK_RUNNING, if need_resched() returns true, while polling for IO completion. Earlier, polling task used to sleep, relying on interrupt to wake it up. This made some IO take very long when interrupt-coalescing is enabled in NVMe.
猜你喜欢:
- 我要投稿:发表原创技术文章,收获福利、挚友与行业影响力
- 泰晓资讯:汇总一周技术趣闻与文章,查看「Linux 资讯」
- 知识星球:独家 Linux 实战经验与技巧,订阅「Linux知识星球」
- 视频频道:泰晓学院,B 站,发布各类 Linux 视频课
- 开源小店:欢迎光临泰晓科技自营店,购物支持泰晓原创
- 技术交流:Linux 用户技术交流微信群,联系微信号:tinylab
支付宝打赏 ¥9.68元 | 微信打赏 ¥9.68元 | |
请作者喝杯咖啡吧 |
Read Album:
- TinyBPT 和面向 buildroot 的二进制包管理服务(1):设计简介与框架
- RISC-V Linux 内核及周边技术动态第 118 期
- RISC-V Linux 内核及周边技术动态第 117 期
- 实时分析工具 rtla timerlat 介绍(二):延迟测试原理
- 实时分析工具 rtla timerlat 介绍(一):交叉编译及使用