泰晓科技 -- 聚焦 Linux - 追本溯源,见微知著!
网站地址:https://tinylab.org

泰晓Linux知识星球:1300+知识点,520+用户
请稍侯

RISC-V Linux 内核及周边技术动态第 45 期

呀呀呀 创作于 2023/05/07

时间:20230507
编辑:晓依
仓库:RISC-V Linux 内核技术调研活动
赞助:PLCT Lab, ISCAS

内核动态

RISC-V 架构支持

v3: Allwinner R329/D1/R528/T113s SPI support

This series is attempt to revive previous work to add support for SPI controller which is used in newest Allwinner’s SOCs R329/D1/R528/T113s https://lore.kernel.org/lkml/BYAPR20MB2472E8B10BFEF75E7950BBC0BCF79@BYAPR20MB2472.namprd20.prod.outlook.com/

v1: riscv: mm: use bitmap_zero() API

bitmap_zero() is faster than bitmap_clear(), so use bitmap_zero() instead of bitmap_clear().

v1: RISC-V: KVM: use bitmap_zero() API

bitmap_zero() is faster than bitmap_clear(), so use bitmap_zero() instead of bitmap_clear().

v3: Add TDM audio on StarFive JH7110

This patchset adds TDM audio driver for the StarFive JH7110 SoC. The first patch adds device tree binding for TDM module. The second patch adds tdm driver support for JH7110 SoC. The last patch adds device node of tdm and sound card to JH7110 dts.

The series has been tested on the VisionFive 2 board by plugging an audio expansion board.

For more information of audio expansion board, you can take a look at the following webpage: https://wiki.seeedstudio.com/ReSpeaker_2_Mics_Pi_HAT/

v1: perf build: Add system include paths to BPF builds

There are insufficient headers in tools/include to satisfy building BPF programs and their header dependencies. Add the system include paths from the non-BPF clang compile so that these headers can be found.

This code was taken from: tools/testing/selftests/bpf/Makefile

GIT PULL: RISC-V Patches for the 6.4 Merge Window, Part 2

RISC-V Patches for the 6.4 Merge Window, Part 2

  • Support for hibernation.
  • .rela.dyn has been moved to init.
  • A fix for the SBI probing to allow for implementation-defined behavior.
  • Various other fixes and cleanups throughout the tree.

There are still a few minor build issues with drivers, but patches are on the lists. Aside from that things look good with a merge from Linus’ master as of last night, I’ve got another test running now but I don’t see anything scary.

v1: riscv: Optimize memset

This patch has been optimized for memset data sizes less than 16 bytes. Compared to byte by byte storage, significant performance improvement has been achieved.

v1: riscv: dts: allwinner: d1: Add SPI0 controller node

Some boards form the MangoPi family (MQ\MQ-Dual\MQ-R) may have an optional SPI flash that connects to the SPI0 controller. This controller is already supported by sun8i-h3-spi driver. So let’s add its DT node.

v2: RISC-V: Detect Ssqosid extension and handle sqoscfg CSR

This RFC series adds initial support for the Ssqosid extension and the sqoscfg CSR as specified in Chapter 2 of the RISC-V Capacity and Bandwidth Controller QoS Register Interface (CBQRI) specification [1].

QoS (Quality of Service) in this context is concerned with shared resources on an SoC such as cache capacity and memory bandwidth. Intel and AMD already have QoS features on x86, and there is an existing user interface in Linux: the resctrl virtual filesystem [2].

The sqoscfg CSR provides a mechanism by which a software workload (e.g. a process or a set of processes) can be associated with a resource control ID (RCID) and a monitoring counter ID (MCID) that accompanies each request made by the hart to shared resources like cache. CBQRI defines operations to configure resource usage limits, in the form of capacity or bandwidth, for an RCID. CBQRI also defines operations to configure counters to track the resource utilization of an MCID.

The CBQRI spec is still in draft state and is undergoing review [3]. It is possible there will be changes to the Ssqosid extension and the CBQRI spec. For example, the CSR address for sqoscfg is not yet finalized.

My goal for this RFC is to determine if the 2nd patch is an acceptable approach to handling sqoscfg when switching tasks. This RFC was tested against a QEMU branch that implements the Ssqosid extension [4]. A test driver [5] was used to set sqoscfg for the current process. This allows __switch_to_sqoscfg() to be tested without resctrl.

This series is based on riscv/for-next at:

b09313dd2e72 (“RISC-V: hwprobe: Explicity check for -1 in vdso init”)

v2: Split ptdesc from struct page

The MM subsystem is trying to shrink struct page. This patchset introduces a memory descriptor for page table tracking - struct ptdesc.

This patchset introduces ptdesc, splits ptdesc from struct page, and converts many callers of page table constructor/destructors to use ptdescs.

Ptdesc is a foundation to further standardize page tables, and eventually allow for dynamic allocation of page tables independent of struct page. However, the use of pages for page table tracking is quite deeply ingrained and varied across archictectures, so there is still a lot of work to be done before that can happen.

This is rebased on next-20230428.

v3: riscv: allow case-insensitive ISA string parsing

This patchset allows case-insensitive ISA string parsing, which is needed in the ACPI environment. As the RISC-V Hart Capabilities Table (RHCT) description in UEFI Forum ECR[1] shows the format of the ISA string is defined in the RISC-V unprivileged specification[2]. However, the RISC-V unprivileged specification defines the ISA naming strings are case-insensitive while the current ISA string parser in the kernel only accepts lowercase letters. In this case, the kernel should allow case-insensitive ISA string parsing. Moreover, this reason has been discussed in Conor’s patch[3]. And I have also checked the current ISA string parsing in the recent ACPI support patch[4] will also call riscv_fill_hwcap function as DT we use now.

The original motivation for my patch v1[5] is that some SoC generators will provide generated DT with illegal ISA string in dt-binding such as rocket-chip, which will even cause kernel panic in some cases as I mentioned in v1[5]. Now, the rocket-chip has been fixed in PR #3333[6]. However, when using some specific version of rocket-chip with illegal ISA string in DT, this patchset will also work for parsing uppercase letters correctly in DT, thus will have better compatibility.

In summary, this patch not only works for case-insensitive ISA string parsing to meet the requirements in ECR[1] but also can be a workaround for some specific versions of rocket-chip.

进程调度

v2: sched/debug: correct printing for rq->nr_uninterruptible

Commit e6fe3f422be1 (“sched: Make multiple runqueue task counters 32-bit”) changed the type for rq->nr_uninterruptible from “unsigned long” to “unsigned int”, but left wrong cast print to /sys/kernel/debug/sched/debug and to the console.

For example, nr_uninterruptible’s value is fffffff7 with type “unsigned int”, (long)nr_uninterruptible shows 4294967287 while (int)nr_uninterruptible prints -9. So using int cast fixes wrong printing.

v1: sched: core: Simplify init_sched_mm_cid()

int mm_users variable definition move to variable usage location.

v2: sched/deadline: cpuset: Rework DEADLINE bandwidth restoration

Qais reported [1] that iterating over all tasks when rebuilding root domains for finding out which ones are DEADLINE and need their bandwidth correctly restored on such root domains can be a costly operation (10+ ms delays on suspend-resume). He proposed we skip rebuilding root domains for certain operations, but that approach seemed arch specific and possibly prone to errors, as paths that ultimately trigger a rebuild might be quite convoluted (thanks Qais for spending time on this!).

This is v2 of an alternative approach (v1 at [3]) to fix the problem.

v1: sched/numa: Disjoint set vma scan improvements

While this has improved significant system time overhead, there are corner cases, which genuinely needs some relaxation for e.g., concern raised by PeterZ where unfairness amongst the thread belonging to disjoint set of VMSs can potentially amplify the side effects of vma regions belonging to some of the tasks being left unscanned.

With this patch I am seeing good improvement in numa01_THREAD_ALLOC case, but please note that with [1] there was a drastic decrease in system time when benchmarks run, this patch adds back some of the system time.

v2: sched/topology: add for_each_numa_cpu() macro

for_each_cpu() is widely used in kernel, and it’s beneficial to create a NUMA-aware version of the macro.

Recently added for_each_numa_hop_mask() works, but switching existing codebase to it is not an easy process.

This series adds for_each_numa_cpu(), which is designed to be similar to the for_each_cpu(). It allows to convert existing code to NUMA-aware as simple as adding a hop iterator variable and passing it inside new macro. for_each_numa_cpu() takes care of the rest.

At the moment, we have 2 users of NUMA-aware enumerators. One is Melanox’s in-tree driver, and another is Intel’s in-review driver:

https://lore.kernel.org/lkml/20230216145455.661709-1-pawel.chmielewski@intel.com/

Both real-life examples follow the same pattern:

    for_each_numa_hop_mask(cpus, prev, node) {
            for_each_cpu_andnot(cpu, cpus, prev) {
                    if (cnt++ == max_num)
                            goto out;
                    do_something(cpu);
            }
            prev = cpus;
    }

With the new macro, it has a more standard look, like this:

    for_each_numa_cpu(cpu, hop, node, cpu_possible_mask) {
            if (cnt++ == max_num)
                    break;
            do_something(cpu);
    }

Straight conversion of existing for_each_cpu() codebase to NUMA-aware version with for_each_numa_hop_mask() is difficult because it doesn’t take a user-provided cpu mask, and eventually ends up with open-coded double loop. With for_each_numa_cpu() it shouldn’t be a brainteaser. Consider the NUMA-ignorant example:

    cpumask_t cpus = get_mask();
    int cnt = 0, cpu;

    for_each_cpu(cpu, cpus) {
            if (cnt++ == max_num)
                    break;
            do_something(cpu);
    }

Converting it to NUMA-aware version would be as simple as:

    cpumask_t cpus = get_mask();
    int node = get_node();
    int cnt = 0, hop, cpu;

    for_each_numa_cpu(cpu, hop, node, cpus) {
            if (cnt++ == max_num)
                    break;
            do_something(cpu);
    }

The latter looks more verbose and avoids from open-coding that annoying double loop. Another advantage is that it works with a ‘hop’ parameter with the clear meaning of NUMA distance, and doesn’t make people not familiar to enumerator internals bothering with current and previous masks machinery.

内存管理

v1: filemap: Handle error return from __filemap_get_folio()

Smatch reports that filemap_fault() was missed in the conversion of __filemap_get_folio() error returns from NULL to ERR_PTR.

v1: mm/gup: add missing gup_must_unshare() check to gup_huge_pgd()

All other instances of gup_huge_pXd() perform the unshare check, so update the PGD-specific function to do so as well.

While checking pgd_write() might seem unusual, this function already performs such a check via pgd_access_permitted() so this is in line with the existing implementation.

v3: memcontrol: support cgroup level OOM protection

Establish a new OOM score algorithm, supports the cgroup level OOM protection mechanism. When an global/memcg oom event occurs, we treat all processes in the cgroup as a whole, and OOM killers need to select the process to kill based on the protection quota of the cgroup

v1: RESEND: Make PCMCIA and QCOM_HIDMA depend on HAS_IOMEM

This is suggested by Niklas when he reviewed patches related to s390 part: https://lore.kernel.org/all/d78edb587ecda0aa09ba80446d0f1883e391996d.camel@linux.ibm.com/T/#u

v1 link: https://lore.kernel.org/all/20230216073403.451455-1-bhe@redhat.com/T/#u

This resend v1 with Niklas and Arnd’s ack tags added.

v1: mbind.2: Clarify MPOL_MF_MOVE with MPOL_INTERLEAVE policy

There was user confusion about specifying MPOL_MF_MOVE* with MPOL_INTERLEAVE policy [1]. Add clarification.

[1] https://lore.kernel.org/linux-mm/20230501185836.GA85110@monkey/

v1: mm/hugetlb: revert use of page_cache_next_miss()

As reported by Ackerley[1], the use of page_cache_next_miss() in hugetlbfs_fallocate() introduces a bug where a second fallocate() call to same offset fails with -EEXIST. Revert this change and go back to the previous method of using get from the page cache and then dropping the reference on success.

hugetlbfs_pagecache_present() was also refactored to use page_cache_next_miss(), revert the usage there as well.

User visible impacts include hugetlb fallocate incorrectly returning EEXIST if pages are already present in the file. In addition, hugetlb pages will not be included in core dumps if they need to be brought in via GUP. userfaultfd UFFDIO_COPY also uses this code and will not notice pages already present in the cache. It may try to allocate a new page and potentially return ENOMEM as opposed to EEXIST.

v2: shmemfs stable directory cookies

The following series is for continued discussion of the need for and implementation of stable directory cookies for shmemfs/tmpfs.

Based on one of Andrew’s review comments, I’ve split this one patch into a series to (hopefully) reduce its complexity and make it easier to analyze the changes.

Although the patch(es) have been passing functional tests for several weeks, there have been some reports of performance regressions that we still need to get to the bottom of.

We might consider a simpler lseek/readdir implementation, as using an xarray is effective but a bit of overkill. I’d like to avoid a linked list implementation as that is known to have significant performance impact past a dozen or so list entries.

v2: maple_tree: Make maple state reusable after mas_empty_area()

Make mas->min and mas->max point to a node range instead of a leaf entry range. This allows mas to still be usable after mas_empty_area() returns. Users would get unexpected results from other operations on the maple state after calling the affected function.

v1: sysctl: add config to make randomize_va_space RO

Add config RO_RANDMAP_SYSCTL to set the mode of the randomize_va_space sysctl to 0444 to disallow all runtime changes. This will prevent accidental changing of this value by a root service.

The config is disabled by default to avoid surprises.

v9: mm/gup: disallow GUP writing to file-backed mappings by default

Writing to file-backed mappings which require folio dirty tracking using GUP is a fundamentally broken operation, as kernel write access to GUP mappings do not adhere to the semantics expected by a file system.

A GUP caller uses the direct mapping to access the folio, which does not cause write notify to trigger, nor does it enforce that the caller marks the folio dirty.

The problem arises when, after an initial write to the folio, writeback results in the folio being cleaned and then the caller, via the GUP interface, writes to the folio again.

As a result of the use of this secondary, direct, mapping to the folio no write notify will occur, and if the caller does mark the folio dirty, this will be done so unexpectedly.

For example, consider the following scenario:-

  1. A folio is written to via GUP which write-faults the memory, notifying the file system and dirtying the folio.
  2. Later, writeback is triggered, resulting in the folio being cleaned and the PTE being marked read-only.
  3. The GUP caller writes to the folio, as it is mapped read/write via the direct mapping.
  4. The GUP caller, now done with the page, unpins it and sets it dirty (though it does not have to).

This change updates both the PUP FOLL_LONGTERM slow and fast APIs. As pin_user_pages_fast_only() does not exist, we can rely on a slightly imperfect whitelisting in the PUP-fast case and fall back to the slow case should this fail.

v1: MDWE without inheritance

Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags current with a flag that prevents pages that were previously not executable from becoming executable.

This tag always gets inherited by children tasks. (it’s in MMF_INIT_MASK)

At Google, we’ve been using a somewhat similar downstream patch for a few years now. To make the adoption of this feature easier, we’ve had it support a mode in which the W^X flag does not propagate to children. For example, this is handy if a C process which wants W^X protection suspects it could start children processes that would use a JIT.

I’d like to align our features with the upstream prctl. This series proposes a new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It sets a different flag in current that is not in MMF_INIT_MASK and which does not propagate.

As part of looking into MDWE, I also fixed a couple of things in the MDWE test.

v1: mm: always respect QUEUE_FLAG_STABLE_WRITES on the block device

Commit 1cb039f3dc16 (“bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag”) introduced a regression for the raw block device use case. Capturing QUEUE_FLAG_STABLE_WRITES flag in set_bdev_super() has the effect of respecting it only when there is a filesystem mounted on top of the block device. If a filesystem is not mounted, block devices that do integrity checking return sporadic checksum errors.

Additionally, this commit made the corresponding sysfs knob writeable for debugging purposes. However, because QUEUE_FLAG_STABLE_WRITES flag is captured when the filesystem is mounted and isn’t consulted after that anywhere outside of swap code, changing it doesn’t take immediate effect even though dumping the knob shows the new value. With no way to dump SB_I_STABLE_WRITES flag, this is needlessly confusing.

Resurrect the original stable writes behavior by changing folio_wait_stable() to account for the case of a raw block device and also:

  • for the case of a filesystem, test QUEUE_FLAG_STABLE_WRITES flag each time instead of capturing it in the superblock so that changes are reflected immediately (thus aligning with the case of a raw block device)
  • retain SB_I_STABLE_WRITES flag for filesystems that need stable writes independent of the underlying block device (currently just NFS)

v1: [For stable 5.4] mm: migrate: buffer_migrate_page_norefs() fallback migrate not uptodate pages

Recently we notice that ext4 filesystem occasionally fail to read metadata from disk and report error message, but the disk and block layer looks fine. After analyse, we lockon commit 88dbcbb3a484 (“blkdev: avoid migration stalls for blkdev pages”). It provide a migration method for the bdev, we could move page that has buffers without extra users now, but it will lock the buffers on the page, which breaks a lot of current filesystem’s fragile metadata read operations, like ll_rw_block() for common usage and ext4_read_bh_lock() for ext4, these helpers just trylock the buffer and skip submit IO if it lock failed, many callers just wait_on_buffer() and conclude IO error if the buffer is not uptodate after buffer unlocked.

This issue could be easily reproduced by add some delay just after buffer_migrate_lock_buffers() in __buffer_migrate_page() and do fsstress on ext4 filesystem.

EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #73193:comm fsstress: reading directory lblock 0EXT4-fs error (device pmem1): __ext4_find_entry:1658: inode #75334:comm fsstress: reading directory lblock 0

Something like ll_rw_block() should be used carefully and seems could only be safely used for the readahead case. So the best way is to fix the read operations in filesystem in the long run, but now let us avoid this issue first. This patch avoid this issue by fallback to migrate pages that are not uptodate like fallback_migrate_page(), those pages that has buffers may probably do read operation soon.

v3: fs: implement multigrain timestamps

This is a follow-up of the patches I posted last week [1]. The main change in this set is that it no longer uses the lowest-order bit in the tv_nsec field, and instead uses one of the higher-order bits (#31, specifically) since they are otherwise unused. This change makes things much simpler, and we no longer need to twiddle s_time_gran for it.

v13: cachestat: a new syscall for page cache state of files

This series of patches introduces a new system call, cachestat, that summarizes the page cache statistics (number of cached pages, dirty pages, pages marked for writeback, evicted pages etc.) of a file, in a specified range of bytes. It also include a selftest suite that tests some typical usage. Currently, the syscall is only wired in for x86 architecture.

v1: fs: hugetlbfs: Set vma policy only when needed for allocating folio

Calling hugetlb_set_vma_policy() later avoids setting the vma policy and then dropping it on a page cache hit.

v5: bio: check return values of bio_add_page

This series converts the callers of bio_add_page() which can easily use __bio_add_page() to using it and checks the return of bio_add_page() for callers that don’t work on a freshly created bio.

文件系统

v1: bpf-next: Introduce bpf iterators for file-system

The patchset attempts to provide more observability for the file-system as proposed in [0]. Compared to drgn [1], the bpf iterator for file-system has fewer dependencies (e.g., no need for vmlinux) and more accurate results.

GIT PULL: Pipe FMODE_NOWAIT support

Here’s the revised edition of the FMODE_NOWAIT support for pipes, in which we just flag it as such supporting FMODE_NOWAIT unconditionally, but clear it if we ever end up using splice/vmsplice on the pipe. The pipe read/write side is perfectly fine for nonblocking IO, however splice and vmsplice can potentially wait for IO with the pipe lock held.

v6: Introduce block provisioning primitives

This patch series covers iteration 6 of adding support for block provisioning requests.

v1: fuse: add a new flag to allow shared mmap in FOPEN_DIRECT_IO mode

FOPEN_DIRECT_IO is usually set by fuse daemon to indicate need of strong coherency, e.g. network filesystems. Thus shared mmap is disabled since it leverages page cache and may write to it, which may cause inconsistence. But FOPEN_DIRECT_IO can be used not for coherency but to reduce memory footprint as well, e.g. reduce guest memory usage with virtiofs. Therefore, add a new flag FOPEN_DIRECT_IO_SHARED_MMAP to allow shared mmap for these cases.

v1: -next: lsm: Change inode_setattr() to take struct

I am working on adding xattr/attr support for landlock [1], so we can control fs accesses such as chmod, chown, uptimes, setxattr, etc.. inside landlock sandbox.

v3: dax: enable dax fault handler to report VM_FAULT_HWPOISON

When multiple processes mmap() a dax file, then at some point, a process issues a ‘load’ and consumes a hwpoison, the process receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb set for the poison scope. Soon after, any other process issues a ‘load’ to the poisoned page (that is unmapped from the kernel side by memory_failure), it receives a SIGBUS with si_code = BUS_ADRERR and without valid si_lsb.

This is confusing to user, and is different from page fault due to poison in RAM memory, also some helpful information is lost.

Channel dax backend driver’s poison detection to the filesystem such that instead of reporting VM_FAULT_SIGBUS, it could report VM_FAULT_HWPOISON.

v1: Supporting same fsid filesystems mounting on btrfs

Currently, we cannot reliably mount same fsid filesystems even one at a time in btrfs, but if users want to mount them at the same time, it’s pretty much impossible. Other filesystems like ext4 are capable of that.

The goal is to allow systems with A/B partitioning scheme (like the Steam Deck console or various mobile devices) to be able to hold the same filesystem image in both partitions; it also allows to have block device level check for filesystem integrity - this is used in the Steam Deck image installation, to check if the current read-only image is pristine. A bit more details are provided in the following ML thread:

https://lore.kernel.org/linux-btrfs/c702fe27-8da9-505b-6e27-713edacf723a@igalia.com/

The mechanism used to achieve it is based in the metadata_uuid feature, leveraging such code infrastructure for that. The patches are based on kernel 6.3 and were tested both in a virtual machine as well as in the Steam Deck. Comments, suggestions and overall feedback is greatly appreciated - thanks in advance!

GIT PULL: sysctl changes for v6.4-rc4 v2

As mentioned on my first pull request for sysctl-next, for v6.4-rc1 we’re very close to being able to deprecating register_sysctl_paths(). I was going to assess the situation after the first week of the merge window.

That time is now and things are looking good. We only have one stragglers on the patch which had already an ACK for so I’m picking this up here now and the last patch is the one that uses an axe. Some careful eyeballing would be appreciated by others. If this doesn’t get properly reviewed I can also just hold off on this in my tree for the next merge window. Either way is fine by me.

I have boot tested the last patch and 0-day build completed successfully.

v1: block atomic writes

This series introduces a new proposal to implementing atomic writes in the kernel.

This series takes the approach of adding a new “atomic” flag to each of pwritev2() and iocb->ki_flags - RWF_ATOMIC and IOCB_ATOMIC, respectively. When set, these indicate that we want the write issued “atomically”. I have seen a similar flag for pwritev2() touted on the lists previously.

Only direct IO is supported and for block devices and xfs.

The atomic writes feature requires dedicated HW support, like SCSI WRITE_ATOMIC_16 command.

The goal here is to provide an interface that allow applications use application-specific block sizes larger than logical block size reported by the storage device or larger than filesystem block size as reported by stat().

With this new interface, application blocks will never be torn or fractured. For a power fail, for each individual application block, all or none of the data to be written. A racing atomic write and read will mean that the read sees all the old data or all the new data, but never a mix of old and new.

v4: fs: allow to mount beneath top mount

More common use-cases will just be things like:

   mount -t btrfs /dev/sdA /mnt
   mount -t xfs   /dev/sdB --beneath /mnt
   umount /mnt

after which we’ll have updated from a btrfs filesystem to a xfs filesystem without ever revealing the underlying mountpoint.

v24: xfs: online repair for fs summary counters with exclusive fsfreeze

A longstanding deficiency in the online fs summary counter scrubbing code is that it hasn’t any means to quiesce the incore percpu counters while it’s running. There is no way to coordinate with other threads are reserving or freeing free space simultaneously, which leads to false error reports. Right now, if the discrepancy is large, we just sort of shrug and bail out with an incomplete flag, but this is lame.

For repair activity, we actually /do/ need to stabilize the counters to get an accurate reading and install it in the percpu counter. To improve the former and enable the latter, allow the fscounters online fsck code to perform an exclusive mini-freeze on the filesystem. The exclusivity prevents userspace from thawing while we’re running, and the mini-freeze means that we don’t wait for the log to quiesce, which will make both speedier.

v1: sysctl: death to register_sysctl_paths()

As mentioned on my first pull request for sysctl-next, for v6.4-rc1 we’re very close to being able to deprecating register_sysctl_paths(). I was going to assess the situation after the first week of the merge window.

That time is now and things are looking good. We only have one stragglers on the patch which had already an ACK for so I’m picking this up here now and the last patch is the one that uses an axe. Some careful eyeballing would be appreciated by others. If this doesn’t get properly reviewed I can also just hold off on this in my tree for the next merge window. Either way is fine by me.

I have boot tested the last patch and 0-day build is ongoing. You can give it a day for a warm fuzzy build test result.

v1: Rework locking when rendering mountinfo cgroup paths

Idea for these modification came up when css_set_lock seemed unneeded in cgroup_show_path.

It’s a delicate change, so the deciding factor was when cgroup_show_path popped up also in some profiles of frequent mountinfo readers.

The idea is to trade the exclusive css_set_lock for the shared namespace_sem when rendering cgroup paths. Details are described more in individual commits.

v2: Prepare for supporting more filesystems with fanotify

Following v2 incorporates a few fixes and ACKs from review of v1 [1].

While fanotify relaxes the requirements for filesystems to support reporting fid to require only the ->encode_fh() operation, there are currently no new filesystems that meet the relaxed requirements.

Patches to add ->encode_fh() to overlay with default configuation are available on my github branch [2]. I will re-post them after this patch set will be approved.

Based on the discussion on the UAPI alternatives, I kept the AT_HANDLE_FID UAPI, which seems the simplest of them all.

There is an LTP test [3] that tests reporting fid from overlayfs, which also demonstrates the use of AT_HANDLE_FID for requesting a non-decodeable file handle by userspace and there is a man page draft [4] for the documentation of the AT_HANDLE_FID flags.

v1: FUSE: add another flag to support shared mmap in FOPEN_DIRECT_IO mode

From discussion with Bernd, I get that FOPEN_DIRECT_IO is designed for those user cases where users want strong coherency like network filesystems, where one server serves multiple remote clients. And thus shared mmap is disabled since local page cache existence breaks this kind of coherency.

But here our use case is one virtiofs daemon serve one guest vm, We use FOPEN_DIRECT_IO to reduce memory footprint not for coherency. So we expect shared mmap works in this case. Here I suggest/am implementing adding another flag to indicate this kind of cases—-use FOPEN_DIRECT_IO not for coherency—-so that shared mmap works.

v1: Memory allocation profiling

Memory allocation profiling infrastructure provides a low overhead mechanism to make all kernel allocations in the system visible. It can be used to monitor memory usage, track memory hotspots, detect memory leaks, identify memory regressions.

To keep the overhead to the minimum, we record only allocation sizes for every allocation in the codebase. With that information, if users are interested in more detailed context for a specific allocation, they can enable in-depth context tracking, which includes capturing the pid, tgid, task name, allocation size, timestamp and call stack for every allocation at the specified code location.

v2: permit write-sealed memfd read-only shared mappings

The man page for fcntl() describing memfd file seals states the following about F_SEAL_WRITE:-

Furthermore, trying to create new shared, writable memory-mappings via
mmap(2) will also fail with EPERM.

With emphasis on writable. In turns out in fact that currently the kernel simply disallows all new shared memory mappings for a memfd with F_SEAL_WRITE applied, rendering this documentation inaccurate.

网络设备

v2: Make iscsid-kernel communications namespace-aware

This set of patches modifies the kernel iSCSI initiator communications so that they are namespace-aware. The goal is to allow multiple iSCSI daemon (iscsid) to run at once as long as they are in separate namespaces, and so that iscsid can run in containers.

Container runtime environments seem to want to containerize their own components, and there have been complaints about the need to run iscsid from the host network namespace. There are still priviledged capabilities needed for iscsid, but these changes address the namespace issue.

I’ve tested with iscsi_tcp and iser over rxe with an unmodified iscsid running in a podman container.

Note that with iscsi_tcp, the connected socket will keep the network namespace alive after container exit. The namespace will exit once the connection terminates, and I’d recommend running with a iSCSI noop_out_timeout set to error out the connection after the routing has been removed.

v1: net-next: net: openvswitch: Use struct_size()

Use struct_size() instead of hand writing it. This is less verbose and more informative.

v1: can: kvaser_usb_leaf: Implement CAN 2.0 raw DLC functionality.

v7: bpf-next: Introduce a new kfunc of bpf_task_under_cgroup

Trace sched related functions, such as enqueue_task_fair, it is necessary to specify a task instead of the current task which within a given cgroup.

v1: virtio_net: set default mtu to 1500 when ‘Device maximum MTU’ bigger than 1500

When VIRTIO_NET_F_MTU(3) Device maximum MTU reporting is supported. If offered by the device, device advises driver about the value of its maximum MTU. If negotiated, the driver uses mtu as the maximum MTU value. But there the driver also uses it as default mtu, some devices may have a maximum MTU greater than 1500, this may cause some large packages to be discarded, so I changed the MTU to a more general 1500 when ‘Device maximum MTU’ bigger than 1500.

v1: wifi: mwifiex: Use default @max_active for workqueues

These workqueues only host a single work item and thus doen’t need explicit concurrency limit. Let’s use the default @max_active. This doesn’t cost anything and clearly expresses that @max_active doesn’t matter.

v1: wifi: iwlwifi: Use default @max_active for trans_pcie->rba.alloc_wq

trans_pcie->rba.alloc_wq only hosts a single work item and thus doesn’t need explicit concurrency limit. Let’s use the default @max_active. This doesn’t cost anything and clearly expresses that @max_active doesn’t matter.

GIT PULL: Networking for v6.4-rc1

Current release - regressions:

  • sched: act_pedit: free pedit keys on bail from offset check

Current release - new code bugs:

  • pds_core:
  • Kconfig fixes (DEBUGFS and AUXILIARY_BUS)
  • fix mutex double unlock in error path

Previous releases - regressions:

  • sched: cls_api: remove block_cb from driver_list before freeing

  • nf_tables: fix ct untracked match breakage

  • eth: mtk_eth_soc: drop generic vlan rx offload

  • sched: flower: fix error handler on replace

Previous releases - always broken:

  • tcp: fix skb_copy_ubufs() vs BIG TCP

  • ipv6: fix skb hash for some RST packets

  • af_packet: don’t send zero-byte data in packet_sendmsg_spkt()

  • rxrpc: timeout handling fixes after moving client call connection to the I/O thread

  • ixgbe: fix panic during XDP_TX with > 64 CPUs

  • igc: RMW the SRRCTL register to prevent losing timestamp config

  • dsa: mt7530: fix corrupt frames using TRGMII on 40 MHz XTAL MT7621

  • r8152:
    • fix flow control issue of RTL8156A
    • fix the poor throughput for 2.5G devices
    • move setting r8153b_rx_agg_chg_indicate() to fix coalescing
    • enable autosuspend
  • ncsi: clear Tx enable mode when handling a Config required AEN

  • octeontx2-pf: macsec: fixes for CN10KB ASIC rev

Misc:

  • 9p: remove INET dependency

v2: net-next: netfilter: nft_set_pipapo: Use struct_size()

Use struct_size() instead of hand writing it. This is less verbose and more informative.

v1: RDMA/mana_ib: Use v2 version of cfg_rx_steer_req to enable RX coalescing

With RX coalescing, one CQE entry can be used to indicate multiple packets on the receive queue. This saves processing time and PCI bandwidth over the CQ.

v1: siw on tunnel devices

Chalk this one up to yet another crazy idea.

At NFS testing events, we’d like to test NFS/RDMA over the event’s private network. We can do that with iWARP using siw from guests.

If the guest itself is on the VPN, that means siw’s slave device is a tun device. Such devices have no MAC address. That breaks the RDMA core’s ability to find the correct egress device for siw when given a source IP address.

We’ve worked around this in the past with various software hacks, but we’d rather see full support for this capability in stock kernels.

A direct and perhaps naive way to do that is to give loopback and tun devices their own artificial MAC addresses for this purpose.

v1: iproute2-next: mptcp: add support for implicit flag

Kernel supports implicit flag since commit d045b9eb95a9 (“mptcp: introduce implicit endpoints”), included in v5.18.

Let’s add support for displaying it to iproute2.

Before this change: $ ip mptcp endpoint show 10.0.2.2 id 1 rawflags 10

After this change: $ ip mptcp endpoint show 10.0.2.2 id 1 implicit

v1: ipsec: af_key: Reject optional tunnel/BEET mode templates in outbound policies

xfrm_state_find() uses encap_family of the current template with the passed local and remote addresses to find a matching state. If an optional tunnel or BEET mode template is skipped in a mixed-family scenario, there could be a mismatch causing an out-of-bounds read as the addresses were not replaced to match the family of the next template.

While there are theoretical use cases for optional templates in outbound policies, the only practical one is to skip IPComp states in inbound policies if uncompressed packets are received that are handled by an implicitly created IPIP state instead.

v1: ipsec: xfrm: Reject optional tunnel/BEET mode templates in outbound policies

xfrm_state_find() uses encap_family of the current template with the passed local and remote addresses to find a matching state. If an optional tunnel or BEET mode template is skipped in a mixed-family scenario, there could be a mismatch causing an out-of-bounds read as the addresses were not replaced to match the family of the next template.

While there are theoretical use cases for optional templates in outbound policies, the only practical one is to skip IPComp states in inbound policies if uncompressed packets are received that are handled by an implicitly created IPIP state instead.

v1: net: socket: Use fdget() and fdput()

By using the fdget function, the socket object, can be quickly obtained from the process’s file descriptor table without the need to obtain the file descriptor first before passing it as a parameter to the fget function.

v2: Add motorcomm phy pad-driver-strength-cfg support

The motorcomm phy (YT8531) supports the ability to adjust the drive strength of the rx_clk/rx_data, and the default strength may not be suitable for all boards. So add configurable options to better match the boards.(e.g. StarFive VisionFive 2)

The first patch adds a description of dt-bingding, and the second patch adds YT8531’s parsing and settings for pad-driver-strength-cfg.

v6: net-next: TXGBE PHYLINK support

Implement I2C, SFP, GPIO and PHYLINK to setup TXGBE link.

Because our I2C and PCS are based on Synopsys Designware IP-core, extend the i2c-designware and pcs-xpcs driver to realize our functions.

v1: vhost_net: Use fdget() and fdput()

convert the fget()/fput() uses to fdget()/fdput().

v6: can: usb: f81604: add Fintek F81604 support

This patch adds support for Fintek USB to 2CAN controller.

安全增强

v1: Hypervisor-Enforced Kernel Integrity

This patch series is a proof-of-concept that implements new KVM features (extended page tracking, MBEC support, CR pinning) and defines a new API to protect guest VMs. No VMM (e.g., Qemu) modification is required.

The main idea being that kernel self-protection mechanisms should be delegated to a more privileged part of the system, hence the hypervisor. It is still the role of the guest kernel to request such restrictions according to its configuration. The high-level security guarantees provided by the hypervisor are semantically the same as a subset of those the kernel already enforces on itself (CR pinning hardening and memory page table protections), but with much higher guarantees.

We’d like the mainline kernel to support such hardening features leveraging virtualization. We’re looking for reviews and comments that can help mainline these two parts: the KVM implementation and the guest kernel API layer designed to support different hypervisors. The struct heki_hypervisor enables to plug in

v1: Compiler Attributes: Add __counted_by macro

In an effort to annotate all flexible array members with their run-time size information, the “element_count” attribute is being introduced by Clang[1] and GCC[2] in future releases. This annotation will provide the CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE features the ability to perform run-time bounds checking on otherwise unknown-size flexible arrays.

Even though the attribute is under development, we can start the annotation process in the kernel. This requires defining a macro for it, even if we have to change the name of the actual attribute later. Since it is likely that this attribute may change its name to “counted_by” in the future (to better align with a future total bytes “sized_by” attribute), name the wrapper macro “__counted_by”, which also reads more clearly (and concisely) in structure definitions.

[1] https://reviews.llvm.org/D148381 [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108896

异步 IO

v1: io_uring: set plug tags for same file

io_uring tries to optimize allocating tags by hinting to the plug how many it expects to need for a batch instead of allocating each tag individually. But io_uring submission queueus may have a mix of many devices for io, so the number of io’s counted may be overestimated. This can lead to allocating too many tags, which adds overhead to finding that many contiguous tags, freeing up the ones we didn’t use, and may starve out other users that can actually use them.

When starting a new batch of uring commands, count only commands that match the file descriptor of the first seen for this optimization.

v4: io_uring: Pass the whole sqe to commands

These three patches prepare for the sock support in the io_uring cmd, as described in the following RFC:

Since the support linked above depends on other refactors, such as the sock ioctl() sock refactor, I would like to start integrating patches that have consensus and can bring value right now. This will also reduce the patchset size later.

Regarding to these three patches, they are simple changes that turn io_uring cmd subsystem more flexible (by passing the whole SQE to the command), and cleaning up an unnecessary compile check.

These patches were tested by creating a file system and mounting an NVME disk using ubdsrv/ublkb0.

v12: io_uring: add napi busy polling support

This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id’s that are currently enabled for busy polling. This list is used to determine which napi id’s enabled busy polling. For faster access it also adds a hash table.

When a new napi id is added, the hash table is used to locate if the napi id has already been added. When processing the busy poll loop the list is used to process the individual elements.

io-uring allows specifying two parameters:

  • busy poll timeout and
  • prefer busy poll to call of io_napi_busy_loop() This sets the above parameters for the ring. The settings are passed with a new structure io_uring_napi.

There is also a corresponding liburing patch series, which enables this feature. The name of the series is “liburing: add add api for napi busy poll timeout”. It also contains two programs to test the this.

Testing has shown that the round-trip times are reduced to 38us from 55us by enabling napi busy polling with a busy poll timeout of 100us. More detailled results are part of the commit message of the first patch.

v1: io_uring: undeprecate epoll_ctl support

Libuv recently started using it so there is at least one consumer now.

v1: Rethinking splice

IORING_OP_SPLICE has problems, many of them are fundamental and rooted in the uapi design, see the patch 8 description. This patchset introduces a different approach, which came from discussions about splices and fused commands and absorbed ideas from both of them. We remove reliance onto pipes and registering “spliced” buffers with data as an io_uring’s registered buffer. Then the user can use it as a usual registered buffer, e.g. pass it to IORING_OP_WRITE_FIXED.

Once a buffer is released, it’ll be returned back to the file it originated from via a callback. It’s carried on on the level of the enitre buffer rather than on per-page basis as with splice, which, as noted by Ming, will allow more optimisations.

The communication with the target file is done by a new fops callback, however the end mean of getting a buffer might change. It also peels layers of code compared to splice requests, which helps it to be more flexible and support more cases. For instance, Ming has a case where it’s beneficial for the target file to provide a buffer to be filled with read/recv/etc. requests and then returned back to the file.

v1: io_uring attached nvme queue

This series shows one way to do what the title says.

This puts up a more direct/lean path that enables

  • submission from io_uring SQE to NVMe SQE
  • completion from NVMe CQE to io_uring CQE Essentially cutting the hoops (involving request/bio) for nvme io path.

Also, io_uring ring is not to be shared among application threads. Application is responsible for building the sharing (if it feels the need). This means ring-associated exclusive queue can do away with some synchronization costs that occur for shared queue.

Primary objective is to amp up of efficiency of kernel io path further (towards PCIe gen N, N+1 hardware). And we are seeing some asks too [1].

Rust For Linux

v2: rust: str: add conversion from CStr to CString

These methods can be used to copy the data in a temporary c string into a separate allocation, so that it can be accessed later even if the original is deallocated.

The API in this change mirrors the standard library API for the &str and String types. The ToOwned trait is not implemented because it assumes that allocations are infallible.

v1: Rust null block driver

A null block driver is a good opportunity to evaluate Rust bindings for the block layer. It is a small and simple driver and thus should be simple to reason about. Further, the null block driver is not usually deployed in production environments. Thus, it should be fairly straight forward to review, and any potential issues are not going to bring down any production workloads.

v1: rust: error: add ERESTARTSYS error code

This error code was probably excluded here originally because it never actually reaches user programs when a syscall returns it. However, from the perspective of a kernel driver, it is still a perfectly valid error type, that the driver might need to return. E.g., this can be necessary when a signal occurs during sleep.

v1: rust: error: allow specifying error type on Result

Currently, if the kernel::error::Result type is in scope (which is often is, since it’s in the kernel’s prelude), you cannot write Result<T, SomeOtherErrorType> when you want to use a different error type than kernel::error::Error.

To solve this we change the error type from being hard-coded to just being a default generic parameter. This still lets you write Result<T> when you just want to use the Error error type, but also lets you write Result<T, SomeOtherErrorType> when necessary.

BPF

v2: bpf-next: bpftool: Support bpffs mountpoint as pin path for prog loadall

Currently, when using prog loadall, if the pin path is a bpffs mountpoint, bpffs will be repeatedly mounted to the parent directory of the bpffs mountpoint path.

For example,$ bpftool prog loadall test.o /sys/fs/bpf currently bpffs will be repeatedly mounted to /sys/fs.

v3: bpf-next: Dynptr Verifier Adjustments

These patches relax a few verifier requirements around dynptrs. Patches 1-3 are unchanged from v2, apart from rebasing Patch 4 is the same as in v1, see https://lore.kernel.org/bpf/CA+PiJmST4WUH061KaxJ4kRL=fqy3X6+Wgb2E2rrLT5OYjUzxfQ@mail.gmail.com/ Patch 5 adds a test for the change in Patch 4

v1: bpf: netdev: init the offload table earlier

Some netdevices may get unregistered before late_initcall(), we have to move the hashtable init earlier.

v1: bpf-next: RFC: bpf: query effective progs without cgroup_mutex

We’re observing some stalls on the heavily loaded machines in the cgroup_bpf_prog_query path. This is likely due to being blocked on cgroup_mutex.

IIUC, the cgroup_mutex is there mostly to protect the non-effective fields (cgrp->bpf.progs) which might be changed by the update path. For the BPF_F_QUERY_EFFECTIVE case, all we need is to rcu_dereference a bunch of pointers (and keep them around for consistency), so let’s do it.

Sending out as an RFC because it looks a bit ugly. It would also be nice to handle non-effective case locklessly as well, but it might require a larger rework.

v3: bpf-next: Add precision propagation for subprogs and callbacks

This patch set teaches BPF verifier to support SCALAR precision backpropagation across multiple frames (for subprogram calls and callback simulations) and addresses most practical situations (SCALAR stack loads/stores using registers other than r10 being the last remaining limitation, though thankfully rarely used in practice).

v4: bpf-next: bpf: Don’t EFAULT for {g,s}setsockopt with wrong optlen

optval larger than PAGE_SIZE leads to EFAULT if the BPF program isn’t careful enough. This is often overlooked and might break completely unrelated socket options. Instead of EFAULT, let’s ignore BPF program buffer changes. See the first patch for more info.

In addition, clearly document this corner case and reset optlen in our selftests (in case somebody copy-pastes from them).

v3: net: bonding: add xdp_features support

Introduce xdp_features support for bonding driver according to the slave devices attached to the master one. xdp_features is required whenever we want to xdp_redirect traffic into a bond device and then into selected slaves attached to it.

v1: bpf-next: bpf_refcount followups (part 1)

This series is the first of two (or more) followups to address issues in the bpf_refcount shared ownership implementation discovered by Kumar. Specifically, this series addresses the “bpf_refcount_acquire on non-owning ref in another tree” scenario described in [0], and does not address issues raised in [1]. Further followups will address the other issues.

v7: bpf-next: bpf: Add socket destroy capability

This patch adds the capability to destroy sockets in BPF. We plan to use the capability in Cilium to force client sockets to reconnect when their remote load-balancing backends are deleted. The other use case is on-the-fly policy enforcement where existing socket connections prevented by policies need to be terminated.

[RFC/PATCH] libbpf: Store zero fd to fd_array for loader kfunc relocation

When moving some of the test kfuncs to bpf_testmod I hit an issue when some of the object’s kfuncs are in module and some in vmlinux.

The problem is that both vmlinux and module kfuncs get btf_fd_idx index into fd_array, but we store to it the BTF fd value only for module’s kfunc.

Then after the program is loaded we check if fd_array[btf_fd_idx] != 0 and close the fd.

When the object has kfuncs from both vmlinux and module, the fd from fd_array[btf_fd_idx] from previous load will be there for vmlinux kfunc and we close unrelated fd (of the program we just loaded in my case).

Not sure if there’s easier way to clear the fd_array between the loads, but the change below seems to fix the issue for me.

v1: bpf-next: Centralize BPF permission checks

This patch set refactors BPF subsystem permission checks for BPF maps and programs, localizes them in one place, and ensures all parts of BPF ecosystem (BPF verifier and JITs, and their supporting infra) use recorded effective capabilities, stored in respective bpf_map or bpf_prog structs, for further decision making.

This allows for more explicit and centralized handling of BPF-related capabilities and makes for simpler further BPF permission model evolution, to be proposed and discussed in follow up patch sets.

v1: bpf-next: bpf: Emit struct bpf_tcp_sock type in vmlinux BTF

In one of our internal testing, we found a case where

  • uapi struct bpf_tcp_sock is in vmlinux.h where vmlinux.h is not generated from the testing kernel
  • struct bpf_tcp_sock is not in vmlinux BTF

The above combination caused bpf load failure as the following memory accessstruct bpf_tcp_sock *tcp_sock = …;… tcp_sock->snd_cwnd … needs CORE relocation but the relocation cannot be resolved since the kernel BTF does not have corresponding type.

Similar to other previous cases (nf_conn___init, tcp6_sock, mctcp_sock, etc.), add the type to vmlinux BTF with BTF_EMIT_TYPE macro.

v9: tracing: Add fprobe/tracepoint events

With this fprobe events, we can continue to trace function entry/exit even if the CONFIG_KPROBES_ON_FTRACE is not available. Since CONFIG_KPROBES_ON_FTRACE requires the CONFIG_DYNAMIC_FTRACE_WITH_REGS, it is not available if the architecture only supports CONFIG_DYNAMIC_FTRACE_WITH_ARGS (e.g. arm64). And that means kprobe events can not probe function entry/exit effectively on such architecture. But this problem can be solved if the dynamic events supports fprobe events because fprobe events doesn’t use kprobe but ftrace via fprobe.

v3: bpf-next: Handle immediate reuse in bpf memory allocator

As discussed in v1, currently the freed objects in bpf memory allocator may be reused immediately by the new allocation, it introduces use-after-bpf-ma-free problem for non-preallocated hash map and makes lookup procedure return incorrect result. The immediate reuse also makes introducing new use case more difficult (e.g. qp-trie).

The patch series tries to solve these problems by introducing BPF_MA_{REUSE|FREE}AFTER_RCU_GP in bpf memory allocator. For REUSE_AFTER_GP, the freed objects are reused only after one RCU grace period and may be freed by bpf memory allocator after another RCU-tasks-trace grace period. So for bpf programs which care about reuse problem, these programs can use bpf_rcu_read{lock,unlock}() to access these objects safely and for those which doesn’t care, there will be safely use-after-bpf-ma-free because these objects have not been freed by bpf memory allocator. FREE_AFTER_GP behavior differently. Instead of making the freed elements being reusable after one RCU GP, it directly freed these elements back to slab after one RCU GP, so sleepable bpf program must use bpf_rcu_read_{lock,unlock}() to access elements allocated from FREE_AFTER_GP bpf memory allocator.

Personally I prefer FREE_AFTER_RCU_GP because its implementation is much simpler compared with REUSE_AFTER_RCU and its memory usage is also better than REUSE_AFTER_GP. But its shortcoming is also obvious, so I want to get some feedback before putting in more effort. As usual, comments and suggestions are always welcome.

周边技术动态

Qemu

[PTACH v2 0/6] Add RISC-V KVM AIA Support

This series adds support for KVM AIA in RISC-V architecture.

In order to test these patches, we require Linux with KVM AIA support which can be found in the qemu_kvm_aia branch at https://github.com/yong-xuan/linux.git This kernel branch is based on the riscv_aia_v1 branch available at https://github.com/avpatel/linux.git, and it also includes two additional patches that fix a KVM AIA bug and reply to the query of KVM_CAP_IRQCHIP.

v1: riscv-to-apply queue

First RISC-V PR for 8.1

  • CPURISCVState related cleanup and simplification
  • Refactor Zicond and reuse in XVentanaCondOps
  • Fix invalid riscv,event-to-mhpmcounters entry
  • Support subsets of code size reduction extension
  • Fix itrigger when icount is used
  • Simplification for RVH related check and code style fix
  • Add signature dump function for spike to run ACT tests
  • Rework MISA writing
  • Fix mstatus.MPP related support
  • Use check for relationship between Zdinx/Zhinx{min} and Zfinx
  • Fix the H extension TVM trap
  • A large collection of mstatus sum changes and cleanups
  • Zero init APLIC internal state
  • Implement query-cpu-definitions
  • Restore the predicate() NULL check behavior
  • Fix Guest Physical Address Translation
  • Make sure an exception is raised if a pte is malformed
  • Add Ventana’s Veyron V1 CPU

v3: linux-user: Add /proc/cpuinfo handler for RISC-V

v1: tcg/riscv: Support for Zba, Zbb, Zicond extensions

Based-on: 20230503070656.1746170-1-richard.henderson@linaro.org (“v4: tcg: Improve atomicity support”)

I’ve been vaguely following the __hw_probe syscall progress in the upstream kernel. The initial version only handled bog standard F+D and C extensions, which everything expects to be present anyway, which was disappointing. But at least the basis is there for proper extensions.

In the meantime, probe via sigill. Tested with qemu-on-qemu. I understand the Ventana core has all of these, if you’d be so kind as to test.

U-Boot

v3: SPL NVMe support

This patchset adds support to load images of the SPL’s next booting stage from a NVMe device.

v2: SPL NVme support

This patchset adds support to load images of the SPL’s next booting stage from a NVMe device.



Read Album:

Read Related:

Read Latest: